# Computational Models of Behavior, part 1

How to fit reinforcement learning models to behavioral data using Maximum Likelihood Estimation

Cover image credit: Gerd Altmann from Pixabay

## Introduction

It is pretty clear that neuroscience is shaping the artificial intelligence (AI) field by means of new research questions and ideas. But at the same time AI also has a huge impact on the neuroscience field. Over the last decades reinforcement learning (RL) algorithms have been particularly useful in attempts to understand how humans and animals make decisions and they’ve provided methods to quantify the parameters of behavioral data, which could be also correlated to neuronal data. For example, temporal difference (TD) learning model has highlighted the role of dopamine in decision-making in a classical study of Schultz (1998)1.

Even though RL methods have been applied in a neuroscience field for quite some time, some may still be unsure how these methods can help with answering research questions or how to actually apply modeling in practice. While there are already several comprehensive articles on how to apply RL models to decision-making data, the purpose of this post is to provide an actual example of model fitting with coding guidelines and also to compare two main approaches, maximum likelihood estimation (MLE) and Bayesian method. Part 1 will focus on the MLE approach with the example code in Python, whereas part 2 will focus on the Bayesian approach written in R with the help of hBayesDM package. Both parts should be viewed as an overview, and not as means of selecting the one-size-fits-all approach. The explanations and basics of theory will be provided, however for a more in-depth overview reader might want to refer to Lockwood et al. (2021)2, Daw (2011)3 and Wilson et al. (2019)4.

Objectives:

1. How to answer research questions using modelling.
2. How to fit a RL model to behavioral data using MLE.
3. How to compare models and choose the best fitting one.
4. How to generate the data using RL model.
Code

import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pingouin as pg
from tqdm import tqdm
from scipy.special import softmax, log_softmax
from scipy.optimize import minimize
from scipy.stats import chisquare

plt.style.use('bmh')
plt.rcParams['figure.facecolor'] = 'white'


Research question:

We hypothesize that treatment (could be anything, dopamine manipulations, exposure to stress, etc.) affects humans’ reward sensitivity and the learning rate for a positive outcome (the speed of learning where the optimal choice is).

We could test this using so-called multi-armed bandit (MAB) paradigm, with a slight modification. In our case, each “arm” can bring reward and punishment at the same time for each trial. The probabilities of outcomes are stationary over time. Subject has to make a decision (choose an “arm”) at each trial. We will use a 4-armed bandit.

We will also have two groups, control (CTR) and treatment (TRT), and compare the parameters between them.

## Data generation

For this article, we will generate artificial data, but we will treat it as real-life data (at least most of the time). In other words, we will generate data with a set of parameter values, but then we will try to recover them using the modeling. On the one hand, this will allow to perform a quantitative assessment of model fitting techniques (since we know the ground truth). On the other hand, we can see how to perform a model fitting in a “real-life”.

We will use the model adapted from Seymour et al. (2012)5. Parameters of the model are:

• learning rate for reward outcomes $\alpha^{\text{reward}}$ (how fast agent learns from the positive feedback)
• learning rate for punishment outcomes $\alpha^{\text{punishment}}$ (how fast agent learns from the negative feedback)
• reward sensitivity $R$ (how agent perceives positive feedback)
• punishment sensitivity $P$ (how agent perceives negative feedback)

Model algorithm:

• for each step $t$ in episode do
• $\scriptsize Q(a_j)_t \leftarrow Q(a_j)^{\text{reward}}_t + Q(a_j)^{\text{punishment}}_t$
• Select action $a_t$ using softmax policy $\scriptsize P(a_j = a_t) = \frac{e^{Q_j}}{\sum_{j}e^{Q_j}}$
• Observe reward $r_t$ and punishment $p_t$ values
• for each action $a_j$ in possible actions do
• if $a_j$ == $a_t$ do
• $\scriptsize Q(a_j)^{\text{reward}}_{t+1} - Q(a_j)^{\text{reward}}_t + \alpha^{\text{reward}} \left( R \cdot r_t - Q(a_j)^{\text{reward}}_t \right)$
• $\scriptsize Q(a_j)^{\text{punishment}}_{t+1} \leftarrow Q(a_j)^{\text{punishment}}_t + \alpha^{\text{punishment}} \left( P \cdot p_t - Q(a_j)^{\text{punishment}}_t \right)$
• else
• $\scriptsize Q(a_j)^{\text{reward}}_{t+1} \leftarrow Q(a_j)^{\text{reward}}_t + \alpha^{\text{reward}} \left( - Q(a_j)^{\text{reward}}_t \right)$
• $\scriptsize Q(a_j)^{\text{punishment}}_{t+1} \leftarrow Q(a_j)^{\text{punishment}}_t + \alpha^{\text{punishment}} \left( - Q(a_j)^{\text{punishment}}_t \right)$
• end
• end
• end
A note on terminology. In neuroscience settings, we usually call participants “subjects”, whereas in RL we usually refer to them as “agents” (which may also have more abstract meaning). Also, a set of consecutive trials is usually called a “session” in neuroscience, but an “episode” in RL. In this article we may use these terms interchangeably.
Code

def generate_agent_data(
alpha_rew, alpha_pun, rew_sens,
pun_sens, rew_prob, pun_prob, rew_vals,
n_arms=4, n_trials=100):
"""
Simulate agent data for N-armed bandit task.

Arguments
----------
alpha_rew : float
Learning rate for rewarding trials. Should be in a (0, 1) range.

alpha_pun : float
Learning rate for punishing trials. Should be in a (0, 1) range.

rew_sens : float
Reward sensitivity (R)

pun_sens : float
Punishment sensitivity (P)

rew_prob : array_like
Probability of reward for each arm.

pun_prob : array_like
Probability of punishment for each arm.

rew_vals : array_like
Values of reward (first element) and punishment (second element).
For example [100, -50] means that agent receives 100 a.u. during
the reward and looses 50 a.u. during the punishment.

n_arms : int, default=4
Number of arms in N-armed bandit task.

n_trials : int, default=100
Number of simulated trials for an agent

Returns
----------
actions : array of shape (n_trials,)
Selected action by the agent at each trial.

gains : array of shape (n_trials,)
Agent's reward value at each trial.

losses : array of shape (n_trials,)
Agent's punishment value at each trial.

Qs : array of shape (n_trials, n_arms)
Value function for each arm at each trial.
"""

actions = np.zeros(shape=(n_trials,), dtype=np.int32)
gains = np.zeros(shape=(n_trials,), dtype=np.int32)
losses = np.zeros(shape=(n_trials,), dtype=np.int32)
Qs = np.zeros(shape=(n_trials, n_arms))

# initial uniform values for both reward and punishment value function
Q_rew = np.ones(shape=(n_arms,))
Q_rew /= n_arms

Q_pun = np.ones(shape=(n_arms,))
Q_pun /= n_arms

Qs[0] = Q_rew + Q_pun

for i in range(n_trials):

# choose the action based of softmax function
prob_a = softmax(Qs[i])
a = np.random.choice(a=range(n_arms), p=prob_a) # select the action
# list of actions that were not selected
a_left = list(range(n_arms))
a_left.remove(a)

# reward
if np.random.rand() < rew_prob[a]: # if arm brings reward
r = rew_vals[0]
else:
r = 0

gains[i] = r
# value function update for a chosen arm
Q_rew[a] += alpha_rew * (rew_sens*r - Q_rew[a])
# value function update for non-chosen arms
for a_l in a_left:
Q_rew[a_l] += alpha_rew*(-Q_rew[a_l])

# punishment
if np.random.rand() < pun_prob[a]:  # if arm brings punishment
r = rew_vals[1]
else:
r = 0

losses[i] = r
Q_pun[a] += alpha_pun * (pun_sens*r - Q_pun[a])
for a_l in a_left:
Q_pun[a_l] += alpha_pun*(-Q_pun[a_l])

# save the records
actions[i] = a

if i < n_trials-1:
Qs[i+1] = Q_rew + Q_pun

return actions, gains, losses, Qs


Initial parameters:

• Probabilities of reward for each option are: 10%, 30%, 60% and 90%.
• Probabilities of punishment for each option are: 90%, 60%, 30% and 10%.
• Value of the reward is $+1$, value of the punishment is $-1$.
• Each group (CTR and TRT) will consist of 10 subjects.
• Each subject will perform 200 trials.

We can see, that an “ideal” agent would learn that 4-th option is the best choice to maximize the total reward, since it has the highest probability of reward and lowest probability of punishment.

Code

# initial parameters
rew_probs = np.array([.1, .3, .6, .9]) # probability of reward of each arm
pun_probs = np.array([.9, .6, .3, .1]) # probability of punishment of each arm
rew_vals = np.array([1, -1]) # reward = 1, punishment = -1
n_subj = 10 # n subjects in each group
n_arms = 4
n_trials = 200

# ids of agents
ids = {'Control': list(range(1, 11)),
'Treatment': list(range(11, 21))}

params = ['alpha_rew', 'alpha_pun', 'R', 'P']
param_names = {
'alpha_rew': 'Learning rate (reward)',
'alpha_pun': 'Learning rate (punishment)',
'R': 'Reward sensitivity',
'P': 'Punishment sensitivity'}

# data generation
np.random.seed(1)
agent_data = pd.DataFrame()
true_params = {}

for subj in itertools.chain(*ids.values()):

true_params[subj] = {}

if subj in ids['Control']:
# control group
group = 'Control'
alpha_rew = np.random.randint(low=30, high=50) / 100 # RV in [0.3, 0.5]
alpha_pun = np.random.randint(low=20, high=40) / 100 # RV in [0.2, 0.4]
rew_sens = np.random.randint(low=20, high=50) / 10 # RV in [2., 5.]
pun_sens = np.random.randint(low=30, high=70) / 10 # RV in [3., 7.]
elif subj in ids['Treatment']:
# treatment group
group = 'Treatment'
alpha_rew = np.random.randint(low=10, high=25) / 100 # RV in [0.1, 0.25]
alpha_pun = np.random.randint(low=20, high=40) / 100 # RV in [0.2, 0.4]
rew_sens = np.random.randint(low=15, high=50) / 10 # RV in [1., 5.]
pun_sens = np.random.randint(low=30, high=70) / 10 # RV in [3., 7.]
else:
raise ValueError(f'No subject with id {subj} found!')

actions, gain, loss, Qs = generate_agent_data(
alpha_rew=alpha_rew,
alpha_pun=alpha_pun,
rew_sens=rew_sens,
pun_sens=pun_sens,
rew_prob=rew_probs,
pun_prob=pun_probs,
rew_vals=rew_vals,
n_trials=n_trials)

# add generated data to temporary data frame
temp_df = pd.DataFrame({
'group': group, 'subjID': subj, 'trial': range(1, n_trials+1),
'choice': actions+1, 'gain': gain, 'loss': loss})

agent_data = agent_data.append(temp_df)

# update dictionary with true parameter values
true_params[subj]['alpha_rew'] = alpha_rew
true_params[subj]['alpha_pun'] = alpha_pun
true_params[subj]['R'] = rew_sens
true_params[subj]['P'] = pun_sens


Sample of generated data:

Table 1: Sample of the generated data
groupsubjIDtrialchoicegainloss
0Control1110-1
1Control1210-1
2Control13310
3Control14310
4Control1510-1
5Control1631-1
6Control17400
7Control18300
8Control1911-1
9Control110410

Even though this data is generated, that’s the exact data you can get from a real behavioral experiment. Each row in a dataset represents a trial for a particular subject. choice column represents the option chosen, gain and loss represent the reward and punishment values for a given trial. Full generated data set can be found here.

Code

true_params_df = pd.DataFrame(true_params).T.reset_index(drop=False)
true_params_df.rename(columns={'index': 'subjID'}, inplace=True)
true_params_df['group'] = true_params_df['subjID'].apply(
lambda x: 'Control' if x in ids['Control'] else (
'Treatment' if x in ids['Treatment'] else 'Undefined'
)
)

# plot boxplots with true parameter values for each group
plt.figure(figsize=(11,6))
for (i, param) in enumerate(params):
plt.subplot(2,2,i+1)
sns.boxplot(
y=param, x='group',
data=true_params_df, linewidth=3,
palette='Set3')
sns.stripplot(
y=param, x='group', data=true_params_df,
jitter=0.1, linewidth=2, palette='Set3')
if i < 2:
plt.xlabel('')
plt.ylim([0, .55]) # use the same scale for each parameter type
else:
plt.xlabel('Group', fontweight='bold')
plt.ylim([1, 8])

plt.ylabel('')
plt.title(param_names[param], fontweight='bold')

plt.suptitle('True parameters of agents', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()


That’s how true parameters are distributed. Visually, it looks like TRT group on average has a lower learning rate for positive feedback compared to CTR.

Reminder, that we don’t have information about the true parameter values in real life. We have access to them here since we are working with the simulated data.

## Maximum Likelihood Estimation

Now we want to recover the parameters of subjects and basically check if our hypothesis is true and treatment affects the learning rate for positive feedback and sensitivity to a reward. As told before, there are two main approaches to do this, fitting the model using MLE and Bayesian way. Here we will focus on the MLE.

The idea of MLE is pretty simple, however, math under the hood is a bit more sophisticated. We want to find such set of parameters $\hat{\Theta} = { \{ \alpha^{\text{reward}}, \alpha^{\text{punishment}}, R, P \} }$, that is more likely to be true. In a simplified version, one can start it by choosing 4 random values for $\hat{\Theta}$, putting them into the algorithm described in data generation part, and then answering the question “What is the probability that actions were selected given these four parameters” (or in a notation way, $P(\text{data}|\Theta)$). After this, one would have to change the values and answer the same question again for a new set of values. After $n$ of such runs, the winning set of parameters is the one that gives the highest probability of the data occurrence. Or in other words, it is the maximum likelihood estimator.

But of course, no one would do that by hand and we can use function optimization techniques to deal with it. In Python we can use minimize function from SciPy package to find the estimated values with the help of limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm. This algorithm is a variation of a gradient descent optimization method.

Since the function is looking for a minimum value, we will change the sign of likelihood in our function (and now function minimize will find the minimum value, while it is actually the highest). Also, we are going to use the logarithm of likelihood, not the likelihood value itself.

Code

def log_lik(x, *args):
"""
Sum of negative log likelihoods over all trials.

Arguments
----------
x : tuple, list
Parameters to estimate by MLE fit.

args: tuple, list
Input parameters for the model.

Returns
----------
Sum of negative log likelihoods.
"""

alpha_rew, alpha_pun, rew_sens, pun_sens = x # parameters to estimate
n_arms, actions, gain, loss = args # input values

Q_rew = np.ones(shape=(n_arms,))
Q_rew /= n_arms

Q_pun = np.ones(shape=(n_arms,))
Q_pun /= n_arms

log_prob_actions = np.zeros(shape=(len(actions), n_arms))

for i, (a, g, l) in enumerate(zip(actions, gain, loss)):

a_left = list(range(n_arms)) # actions that were not selected
a_left.remove(a)

log_prob_action = log_softmax(Q_rew + Q_pun)
log_prob_actions[i] = log_prob_action[a]

Q_rew[a] += alpha_rew * (rew_sens*g - Q_rew[a])
for a_l in a_left:
Q_rew[a_l] += alpha_rew*(-Q_rew[a_l])

Q_pun[a] += alpha_pun * (pun_sens*l - Q_pun[a])
for a_l in a_left:
Q_pun[a_l] += alpha_pun*(-Q_pun[a_l])

return -np.sum(log_prob_actions)

def mle_fit(x0, fun, params, bounds):
"""
Fit model parameters using MLE approach.

Arguments
----------
x0 : array-like
Initial value for each of the parameter.

fun : function
Function name to use for optimization.

params : list
List of parameter names.

bounds : tuple
Tuple of the lower and upper bounds for parameters.

Returns
----------
output : dict
Dictionary with the estimated parameters for each agent.

Example call
----------
mle_fit(
x0=[0.5, 2], fun=logLike,
params=['alpha', 'beta'],
bounds=((0.1, 1), (0, 10))
"""

output = {}

for subj in tqdm(agent_data['subjID'].unique(), desc=f'Fit with x0: {x0}'):

output[subj] = {}
cond = agent_data['subjID'] == subj
actions = (agent_data[cond]['choice'] - 1).to_list()
gain = agent_data[cond]['gain'].to_list()
loss = agent_data[cond]['loss'].to_list()

result = minimize(
fun=fun, x0=x0,
method='L-BFGS-B', # Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm
args=(n_arms, actions, gain, loss),
bounds=bounds)

for (j,param) in enumerate(params):
output[subj][param] = result.x[j]
output[subj]['likelihood'] = result.fun

return output

def best_params(all_params_list):
"""
Find the best fitting parameters from a set of parameters.

Arguments
----------
all_params_list : list
List with the dictionaries of fitted value.
Each dictionary should look as following:
{<subject>: {<parameter1>: <value>, ..., <likelihood>: <value>}}

Returns
----------
selected_params : dict
Dictionary with the best fitted values.
"""

# dictionary {<subject>: [<likelihood1>, <likelihood2>, ...]}
likelihoods = {}
for subj in itertools.chain(*ids.values()):
likelihoods[subj] = []
for param_set in all_params_list:
likelihoods[subj].append(param_set[subj]['likelihood'])

# select the lowest log likelihood among given models
# {<subject>: <index of a set with the lowest log likelihood>}
selected_set = {subj: np.argmin(lik) for (subj, lik) in likelihoods.items()}

# new set of parameters for each subject with best fitted values
selected_params = {}
for subj in selected_set.keys():
selected_params[subj] = all_params_list[selected_set[subj]][subj]

return selected_params


### Different starting values

Before we start fitting, there is one big issue with MLE that we should be aware of. The algorithm can be stuck in a local minimum, meaning that the found set of parameters is not optimal in the entire parameter space. One of the ways to deal with this problem is to use several starting points. Here we will run MLE for 10 random starting values and choose the set that gives the minimum negative log likelihood for each subject.

We will also use bounds for the parameters (learning rates should be in a range $[ 0, 1]$ and sensitivities should be in a range $[ 0, 8 ]$), however Daw (2009)3 argues against that.

Code

all_mle_params = []

for i in range(10):
# generate random initial values
np.random.seed(i)
alpha0 = list(np.random.random(size=(2,)).round(2))
sens0 = list(np.random.randint(low=0, high=80, size=(2,)) / 10)
x0 = alpha0 + sens0

mle_output = mle_fit(
x0=x0, fun=log_lik,
params=params,
bounds=((0, 1), (0, 1), (0, 8), (0, 8)))

all_mle_params.append(mle_output)

selected_params = best_params(all_mle_params)

Fit with x0: [0.55, 0.72, 6.7, 6.7]: 100%|██████████| 20/20 [00:38<00:00,  1.91s/it]
Fit with x0: [0.42, 0.72, 0.9, 7.5]: 100%|██████████| 20/20 [00:36<00:00,  1.81s/it]
Fit with x0: [0.44, 0.03, 2.2, 4.3]: 100%|██████████| 20/20 [00:35<00:00,  1.78s/it]
Fit with x0: [0.55, 0.71, 5.6, 7.2]: 100%|██████████| 20/20 [00:27<00:00,  1.38s/it]
Fit with x0: [0.97, 0.55, 0.1, 7.2]: 100%|██████████| 20/20 [00:45<00:00,  2.25s/it]
Fit with x0: [0.22, 0.87, 1.6, 7.3]: 100%|██████████| 20/20 [00:34<00:00,  1.74s/it]
Fit with x0: [0.89, 0.33, 7.9, 6.2]: 100%|██████████| 20/20 [00:39<00:00,  1.98s/it]
Fit with x0: [0.08, 0.78, 6.7, 2.3]: 100%|██████████| 20/20 [00:34<00:00,  1.75s/it]
Fit with x0: [0.87, 0.97, 0.5, 0.8]: 100%|██████████| 20/20 [00:38<00:00,  1.91s/it]
Fit with x0: [0.01, 0.5, 5.4, 5.6]: 100%|██████████| 20/20 [00:41<00:00,  2.09s/it]


Once we have selected the best fitting values, the following questions may arise:

• Are we sure that the selected values represent the global minimum? We are not sure.
• Were 10 initial starting points enough? We don’t know.
• Would another optimization method find a better set of parameters? It could, but the overview of different optimization methods is beyond the scope of this article.

It may sound vague, but that’s the sad truth about MLE, we are never sure that we’ve found the best parameters. Although, by trying different methods and initial values we can make it as good as possible.

Code

# convert dict to dataframe
mle_params_df = pd.DataFrame(selected_params).T.reset_index(drop=False)
mle_params_df.rename(columns={'index': 'subjID'}, inplace=True)
mle_params_df['group'] = mle_params_df['subjID'].apply(
lambda x: 'Control' if x in ids['Control'] else (
'Treatment' if x in ids['Treatment'] else 'Undefined'
)
)

# plot estimated parameters
plt.figure(figsize=(11,6))
for (i, param) in enumerate(params):
plt.subplot(2,2,i+1)
sns.boxplot(
y=param, x='group',
data=mle_params_df, linewidth=3,
palette='Set3')
sns.stripplot(
y=param, x='group', data=mle_params_df,
jitter=0.1, linewidth=2, palette='Set3')
if i < 2:
plt.xlabel('')
plt.ylim([-0.05, 1.05]) # use the same scale for each parameter type
else:
plt.xlabel('Group', fontweight='bold')
plt.ylim([0, 12])

plt.ylabel('')
plt.title(param_names[param], fontweight='bold')

plt.suptitle('Estimated parameters of agents (MLE)', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()


Now, we can actually estimate how good the selected parameters are.

Code

# plot estimated parameters vs true parameters
plt.figure(figsize=(8,8))

for (i,param) in enumerate(params):

plt.subplot(2,2,i+1)
plt.axis('square')
sns.scatterplot(x=mle_params_df[param], y=true_params_df[param])

r_coef = pg.corr(true_params_df[param], mle_params_df[param])['r'][0]
p_val = pg.corr(true_params_df[param], mle_params_df[param])['p-val'][0]

if i < 2:
lim = [0,1]
plt.plot(lim, lim, ls="--", c=".3")
plt.xlim(lim)
plt.ylim(lim)
else:
lim = [0,10]
plt.plot(lim, lim, ls="--", c=".3")
plt.xlim(lim)
plt.ylim(lim)

plt.xlabel('MLE', fontweight='bold')
plt.ylabel('True', fontweight='bold')
plt.title(
f'{param_names[param]}\nr = {r_coef:.2f} (p-val = {p_val:.2f})',
fontweight='bold')

plt.tight_layout()
plt.show()


In the perfect world, all dots would lie on a diagonal line, meaning that fitted values (on the $x$ axis) match the real values (on the $y$ axis). In our case, we can see that MLE fit recovered values for reward sensitivity quite accurately, whereas punishment sensitivity values are not that aligned.

Yet another reminder, that we are able to make this type of comparison since we are using the generated data. In real life, we don’t have access to the true parameters.

Another caution is that we have 2 estimated values for punishment sensitivity that are equal to or are really close to 8, which was the upper threshold for the optimizing function. This could mean that a better fitting value lies beyond the threshold value.

## Model Comparison

By this point we have found the set of parameters for a model that fit data quite well, but what if our model is wrong? MLE will always return a set of best-fitting parameters, but if the chosen model doesn’t represent the actual mechanism of subjects’ decision making, the results are unreliable. One can deal with such an issue by trying out several models and comparing them.

What if our subjects don’t really have the “competing” sides of positive and negative feedback and the learning rate and sensitivity are the same for both types of an outcome? We will also fit model to data and call this model “No Competition”. Another possible scenario is that subjects don’t update the action values for non-selected actions. We will call this model “No update”. The model that we have fit before will be called “Full Model”.

Now, in order to check these models, we just have to change the log likelihood function that we want to optimize. Once we find the likelihoods of the model, we can compare the models. There are several ways of doing it, but the most common ones are with the help of Akaike information criterion (AIC) or Bayesian information criterion (BIC):

$$AIC = 2k - 2 \text{ln} (\mathcal{\hat{L}})$$

$$BIC = k \ \text{ln}(n) - 2 \text{ln} (\mathcal{\hat{L}})$$

• $k$ - number of estimated parameters in the model
• $\mathcal{\hat{L}}$ - maximum value of the likelihood function for the model
• $n$ - the number of observations (number of trials in our case)

The first part for both equations (before the minus sign) represents the model complexity, whereas the second part (after the minus sign) represents the model performance. Basically, both criteria are looking for a better performing and less complex model. In practice, the lower the values, the better.

Model algorithm for “No Competition” model:

• for each step $t$ in episode do
• Select action $a_t$ using softmax policy $\scriptsize P(a_j = a_t) = \frac{e^{Q_j}}{\sum_{j}e^{Q_j}}$
• Observe reward $r_t$ and punishment $p_t$ values. Outcome $\scriptsize o_t = r_t + p_t$
• for each action $a_j$ in possible actions do
• if $a_j$ == $a_t$ do
• $\scriptsize Q(a_j)_{t+1} \leftarrow Q(a_j)_t + \alpha \left( S \cdot o_t - Q(a_j)_t \right)$
• else
• $\scriptsize Q(a_j)_{t+1} \leftarrow Q(a_j)_t + \alpha \left( - Q(a_j)_t \right)$
• end
• end
• end

Model algorithm for “No Update” model:

• for each step $t$ in episode do
• $\scriptsize Q(a_j)_t \leftarrow Q(a_j)^{\text{reward}}_t + Q(a_j)^{\text{punishment}}_t$
• Select action $a_t$ using softmax policy $\scriptsize P(a_j = a_t) = \frac{e^{Q_j}}{\sum_{j}e^{Q_j}}$
• Observe reward $r_t$ and punishment $p_t$ values
• $\scriptsize Q(a_t)^{\text{reward}}_{t+1} - Q(a_t)^{\text{reward}}_t + \alpha^{\text{reward}} \left( R \cdot r_t - Q(a_t)^{\text{reward}}_t \right)$
• $\scriptsize Q(a_t)^{\text{punishment}}_{t+1} \leftarrow Q(a_t)^{\text{punishment}}_t + \alpha^{\text{punishment}} \left( P \cdot p_t - Q(a_t)^{\text{punishment}}_t \right)$
• end
Code

def log_lik_no_compet(x, *args):
"""
Sum of negative log likelihoods over all trials for "No Competition" model.

Arguments
----------
x : tuple, list
Parameters to estimate by MLE fit.

args: tuple, list
Input parameters for the model.

Returns
----------
Sum of negative log likelihoods.
"""

alpha, outcome_sens = x # parameters to estimate
n_arms, actions, gain, loss = args # input values

Qs = np.ones(shape=(n_arms,))
Qs /= n_arms

log_prob_actions = np.zeros(shape=(len(actions), n_arms))

for i, (a, g, l) in enumerate(zip(actions, gain, loss)):

r = g + l

a_left = list(range(n_arms)) # actions that were not selected
a_left.remove(a)

log_prob_action = log_softmax(Qs)
log_prob_actions[i] = log_prob_action[a]

Qs[a] += alpha * (outcome_sens*r - Qs[a])
for a_l in a_left:
Qs[a_l] += alpha*(-Qs[a_l])

return -np.sum(log_prob_actions)

def log_lik_no_update(x, *args):
"""
Sum of negative log likelihoods over all trials for "No Update" model.

Arguments
----------
x : tuple, list
Parameters to estimate by MLE fit.

args: tuple, list
Input parameters for the model.

Returns
----------
Sum of negative log likelihoods.
"""

alpha_rew, alpha_pun, rew_sens, pun_sens = x # parameters to estimate
n_arms, actions, gain, loss = args # input values

Q_rew = np.ones(shape=(n_arms,))
Q_rew /= n_arms

Q_pun = np.ones(shape=(n_arms,))
Q_pun /= n_arms

log_prob_actions = np.zeros(shape=(len(actions), n_arms))

for i, (a, g, l) in enumerate(zip(actions, gain, loss)):

log_prob_action = log_softmax(Q_rew + Q_pun)
log_prob_actions[i] = log_prob_action[a]

Q_rew[a] += alpha_rew * (rew_sens*g - Q_rew[a])
Q_pun[a] += alpha_pun * (pun_sens*l - Q_pun[a])

return -np.sum(log_prob_actions)

def aic(k, L):
"""
Get the value of Akaike information criterion (AIC).

Arguments
----------
k : int
Number of estimated parameters in the model

L : float
Logarithm of a maximum value of the likelihood function for the model

Returns
----------
AIC value
"""

return 2*k - 2*L

def bic(k, n, L):
"""
Get the value of Bayesian information criterion (BIC).

Arguments
----------
k : int
Number of estimated parameters in the model.

n : int
The number of observations.

L : float
Logarithm of a maximum value of the likelihood function for the model

Returns
----------
BIC value
"""

return k*np.log(n) - 2*L


For both new models, we will also fit 10 different initial parameters and select the optimal ones before comparing models between each other.

Code

all_mle_params_no_compet = []

for i in range(10):
# generate random initial values
np.random.seed(i)
alpha0 = round(np.random.random(), 2)
sens0 = np.random.randint(low=0, high=80) / 10
x0 = [alpha0, sens0]

mle_output = mle_fit(
x0=x0, fun=log_lik_no_compet,
params=['alpha', 'outcome_sens'],
bounds=((0, 1), (0, 8)))

all_mle_params_no_compet.append(mle_output)

selected_params_no_compet = best_params(all_mle_params_no_compet)

Fit with x0: [0.55, 6.4]: 100%|██████████| 20/20 [00:12<00:00,  1.61it/s]
Fit with x0: [0.42, 1.2]: 100%|██████████| 20/20 [00:16<00:00,  1.18it/s]
Fit with x0: [0.44, 7.2]: 100%|██████████| 20/20 [00:09<00:00,  2.08it/s]
Fit with x0: [0.55, 0.3]: 100%|██████████| 20/20 [00:20<00:00,  1.01s/it]
Fit with x0: [0.97, 5.5]: 100%|██████████| 20/20 [00:09<00:00,  2.16it/s]
Fit with x0: [0.22, 6.1]: 100%|██████████| 20/20 [00:12<00:00,  1.63it/s]
Fit with x0: [0.89, 7.9]: 100%|██████████| 20/20 [00:08<00:00,  2.29it/s]
Fit with x0: [0.08, 2.5]: 100%|██████████| 20/20 [00:14<00:00,  1.39it/s]
Fit with x0: [0.87, 0.5]: 100%|██████████| 20/20 [00:19<00:00,  1.01it/s]
Fit with x0: [0.01, 5.4]: 100%|██████████| 20/20 [00:24<00:00,  1.24s/it]

all_mle_params_no_update = []

for i in range(10):
# generate random initial values
np.random.seed(i*10)
alpha0 = list(np.random.random(size=(2,)).round(2))
sens0 = list(np.random.randint(low=0, high=80, size=(2,)) / 10)
x0 = alpha0 + sens0

mle_output = mle_fit(
x0=x0, fun=log_lik_no_update,
params=params,
bounds=((0, 1), (0, 1), (0, 8), (0, 8)))

all_mle_params_no_update.append(mle_output)

selected_params_no_update = best_params(all_mle_params_no_update)

Fit with x0: [0.55, 0.72, 6.7, 6.7]: 100%|██████████| 20/20 [00:18<00:00,  1.10it/s]
Fit with x0: [0.77, 0.02, 6.4, 2.8]: 100%|██████████| 20/20 [00:15<00:00,  1.28it/s]
Fit with x0: [0.59, 0.9, 2.8, 0.9]: 100%|██████████| 20/20 [00:25<00:00,  1.27s/it]
Fit with x0: [0.64, 0.38, 1.2, 2.3]: 100%|██████████| 20/20 [00:35<00:00,  1.79s/it]
Fit with x0: [0.41, 0.06, 5.6, 5.0]: 100%|██████████| 20/20 [00:22<00:00,  1.10s/it]
Fit with x0: [0.49, 0.23, 3.3, 0.4]: 100%|██████████| 20/20 [00:32<00:00,  1.60s/it]
Fit with x0: [0.3, 0.19, 1.5, 7.2]: 100%|██████████| 20/20 [00:44<00:00,  2.21s/it]
Fit with x0: [0.93, 0.87, 6.0, 5.9]: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]
Fit with x0: [0.52, 0.7, 1.0, 5.0]: 100%|██████████| 20/20 [00:45<00:00,  2.26s/it]
Fit with x0: [0.15, 0.16, 6.7, 3.9]: 100%|██████████| 20/20 [00:46<00:00,  2.34s/it]

# combine selected value for each model in one list
all_models = [selected_params, selected_params_no_update, selected_params_no_compet]
model_names = ['Full Model', 'No Update', 'No Competition']

metrics = {}
# dictionary will look like
# {<subject>:
#     {<Model Name>:
#         {'AIC': <value>},
#         {'BIC': <value>}
#     }
# }

for subj in itertools.chain(*ids.values()):

metrics[subj] = {}

for (i,m) in enumerate(all_models):

metrics[subj][model_names[i]] = {}

metrics[subj][model_names[i]]['AIC'] = int(aic(
k=len(m[subj])-1,
L=-m[subj]['likelihood']))

metrics[subj][model_names[i]]['BIC'] = int(bic(
k=len(m[subj])-1,
n=n_trials,
L=-m[subj]['likelihood']))

# convert dictionary to data frame for better visualization
metrics_df = pd.DataFrame({})

for (i, subj) in enumerate(itertools.chain(*ids.values())):
temp = pd.DataFrame(metrics[subj]).reset_index(drop=False)
temp.rename(columns={'index': 'metric'}, inplace=True)
temp['subjID'] = subj
metrics_df = metrics_df.append(temp)

metrics_df.set_index(['subjID', 'metric']).style.highlight_min(
axis=1,
props='color:black; font-weight:bold; background-color:lightgreen;')


Table 2: Model comparison for all subjects
Full ModelNo UpdateNo Competition
subjIDmetric
1AIC169618071692
BIC170918211699
2AIC125714791281
BIC127014921288
3AIC104412621054
BIC105712751060
4AIC180219221806
BIC181619351812
5AIC8631046882
BIC8761059889
6AIC9241058930
BIC9371072936
7AIC147017331516
BIC148317471523
8AIC109712531098
BIC111012661105
9AIC106614721089
BIC107914851095
10AIC171018211749
BIC172418341756
11AIC198721472064
BIC200021602070
12AIC180518741857
BIC181818871864
13AIC196721001988
BIC198021131994
14AIC145515721489
BIC146815851495
15AIC99811351042
BIC101111481049
16AIC197520502035
BIC198820642042
17AIC127214101307
BIC128514241314
18AIC189720391927
BIC191120521933
19AIC188919321894
BIC190219461901
20AIC162717241658
BIC164017371664

As one can see from the results table, BIC and AIC values don’t always agree. And even though “No Competition” model showed lower BIC and AIC values for some of the subjects, we will select the “Full Model” as a winning one, since it fits better for the majority of subjects.

### Comparison to the Random Choice Model

One can also check whether the subject performed actions at random with the help of “classical” statistical approaches, such as Chi-square test.

Our null hypothesis is that actions were selected at random. If this was true in our case, then for 200 trials, we would see 50 selections of each option. An alternative hypothesis is that number of selections of each option is different from 50. Significance level $\alpha$ is going to be 0.05. We will perform the Chi-square test for each subject separately.

Code

alpha = 0.05
chisq_results = {}

for subj in itertools.chain(*ids.values()):
# get the list of observed counts of choices
obs = agent_data[agent_data['subjID'] == subj]['choice']\
.value_counts()\
.sort_index()\
.to_list()
chisq_results[subj] = chisquare(f_obs=obs)[1]

if np.sum(np.array(list(chisq_results.values())) < alpha) == n_subj*2:
print('Rejecting the null hypothesis for all subjects.')
else:
print('Some subject may have chosen actions at random.')

Rejecting the null hypothesis for all subjects.


Results above show that we were able to reject the null hypothesis for all the subjects, meaning that the counts of choices are not equal (or in other words, choices were not selected at random) at the significance level of 0.05.

## Model Validation

After we have selected the model, we could proceed to the model validation. In this step, we try to replicate the original data with the selected parameters and check whether they fit. We will use the same function generate_agent_data that we used to generate the initial set using the estimated values from MLE results and see how artificial data differs from the original one. As a measure of comparison, we can check the ratio of selection of 4-th window (the one with the highest reward probability and lowest punishment probability). This will allow us to indirectly compare the action value functions (if estimated parameters were close to the real parameters, the action value function over the trials would be approximately the same and thus the probability of choosing each option would also be roughly the same).

Code

# data generation
sim_data = pd.DataFrame()

for subj in itertools.chain(*ids.values()):

actions, gain, loss, Qs = generate_agent_data(
alpha_rew=selected_params[subj]['alpha_rew'],
alpha_pun=selected_params[subj]['alpha_pun'],
rew_sens=selected_params[subj]['R'],
pun_sens=selected_params[subj]['P'],
rew_prob=rew_probs,
pun_prob=pun_probs,
rew_vals=rew_vals,
n_trials=n_trials)

temp_df = pd.DataFrame({
'subjID': subj, 'choice': actions+1, 'gain': gain, 'loss': loss})

sim_data = sim_data.append(temp_df)

plt.figure(figsize=(10,20))
for (j, subj) in enumerate(itertools.chain(*ids.values())):
cumavg_true = []
cumavg_sim = []

plt.subplot(10, 2, j+1)
for i in range(n_trials):
cumavg_true.append(np.mean(agent_data[agent_data['subjID'] == subj]['choice'][:i+1] == 4))
cumavg_sim.append(np.mean(sim_data[sim_data['subjID'] == subj]['choice'][:i+1] == 4))

plt.plot(cumavg_true, color='blue', label='True')
plt.plot(cumavg_sim, color='red', label='Simulated')
plt.legend()
plt.title(f'Subject {subj}', fontweight='bold')
plt.ylim([0,1])
if (j+2) % 2 == 0: # for odd indeces
plt.ylabel('$P(a_t = a_4)$')

if j in [18, 19]:
plt.xlabel('Trial', fontweight='bold')

plt.suptitle('Probability of choosing the 4th arm', y = 1.01, fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()


Probabilities (of ratio) of selection of the 4th option don’t always perfectly match between true and simulated values, but it’s fairly close. We can make a conclusion, that model captured the real behavior pretty well.

## Groups Comparison

Once we have selected the model and confirmed, that it can reproduce the real behavioral data, we can come to the main research question we had in the beginning: does treatment change the reward sensitivity and learning rate for a positive outcome?

We have our selected parameters for each subject, now it comes again to the statistical analysis. We will use a two-sample $t$-test to see whether there is a significant difference between the samples of parameters at the significance level $\alpha$ of 0.05.

Results

Code

for param in params:
print(f"==Results for {param_names[param]}==")
ttest_res = pg.ttest(
x=mle_params_df[mle_params_df['group'] == 'Control'][param],
y=mle_params_df[mle_params_df['group'] == 'Treatment'][param])
display(ttest_res)
print()


Results for learning rate (reward):

Tdofalternativep-valCI95%cohen-dBF10power
T-test4.18835218two-sided0.000552[0.1, 0.3]1.87308848.2590.977033

Results for learning rate (punishment):

Tdofalternativep-valCI95%cohen-dBF10power
T-test0.16941818two-sided0.867357[-0.1, 0.12]0.0757660.4010.052959

Results for reward sensitivity:

Tdofalternativep-valCI95%cohen-dBF10power
T-test1.98103318two-sided0.063076[-0.05, 1.79]0.8859451.4630.466243

Results for punishment sensitivity:

Tdofalternativep-valCI95%cohen-dBF10power
T-test-0.75108918two-sided0.462306[-1.77, 0.84]0.3358970.4850.109741

$t$-test results showed that there is only a significant difference between the learning rate for a reward with the large effect size. One could argue that difference for reward sensitivity is also statistically different with the medium effect size.

Another general issue with this approach is that each estimated value of the parameter has its own value of uncertainty (which can be estimated using the inverse of Hessian matrix), but by doing the $t$-test (or another alternative test) we are losing this uncertainty and use the values as if they were fixed. Bayesian approach doesn’t have this issue and we can see it in the 2nd part of the article.

## Summary and Limitations

We have looked at how one can fit RL model to the behavioral data using MLE approach and select the best fitting values. Then we have introduced the idea of models comparison and model validation. Lastly, we have used statistical analysis to answer the research question using selected model parameters.

However, there are several limitations of this review. First of all, data was artificially generated and parameters were selected for simplicity from a given range. Real behavioral data can be much noisier.

Secondly, it is worth mentioning, that the initial data was generated with the research question in mind (cheating) to make it more exciting, that’s why the results came up significantly different. Although it doesn’t change the modeling and analysis steps.

►Computational Models of Behavior, part 2: Bayesian Inference

## References

1. Schultz W. (1998). Predictive reward signal of dopamine neurons. Journal of neurophysiology, 80(1), 1–27. https://doi.org/10.1152/jn.1998.80.1.1 ↩︎

2. Lockwood, P. L., & Klein-Flügge, M. C. (2021). Computational modelling of social cognition and behaviour-a reinforcement learning primer. Social cognitive and affective neuroscience, 16(8), 761–771. https://doi.org/10.1093/scan/nsaa040 ↩︎

3. Daw, N. (2011). Trial-by-trial data analysis using computational models. Affect, Learning and Decision Making, Attention and Performance XXIII. 23. https://doi.org/10.1093/acprof:oso/9780199600434.003.0001 ↩︎

4. Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, e49547. https://doi.org/10.7554/eLife.49547 ↩︎

5. Seymour, B., Daw, N. D., Roiser, J. P., Dayan, P., & Dolan, R. (2012). Serotonin selectively modulates reward value in human decision-making. The Journal of neuroscience : the official journal of the Society for Neuroscience, 32(17), 5833–5842. https://doi.org/10.1523/JNEUROSCI.0053-12.2012 ↩︎

##### Ruslan Klymentiev
###### Data-something-scientist

Trying to do useful things with the help of data and math