Computational Models of Behavior, part 1

Introduction

It is pretty clear that neuroscience is shaping the artificial intelligence (AI) field by means of new research questions and ideas. But at the same time AI also has a huge impact on the neuroscience field. Over the last decades reinforcement learning (RL) algorithms have been particularly useful in attempts to understand how humans and animals make decisions and they’ve provided methods to quantify the parameters of behavioral data, which could be also correlated to neuronal data. For example, temporal difference (TD) learning model has highlighted the role of dopamine in decision-making in a classical study of Schultz (1998)¹.

Even though RL methods have been applied in a neuroscience field for quite some time, some may still be unsure how these methods can help with answering research questions or how to actually apply modeling in practice. While there are already several comprehensive articles on how to apply RL models to decision-making data, the purpose of this post is to provide an actual example of model fitting with coding guidelines and also to compare two main approaches, maximum likelihood estimation (MLE) and Bayesian method. Part 1 will focus on the MLE approach with the example code in Python, whereas part 2 will focus on the Bayesian approach written in R with the help of hBayesDM package. Both parts should be viewed as an overview, and not as means of selecting the one-size-fits-all approach. The explanations and basics of theory will be provided, however for a more in-depth overview reader might want to refer to Lockwood et al. (2021)², Daw (2011)³ and Wilson et al. (2019)⁴.

Objectives:

How to answer research questions using modelling.
How to fit a RL model to behavioral data using MLE.
How to compare models and choose the best fitting one.
How to generate the data using RL model.

Code

 1import itertools
 2import numpy as np
 3import pandas as pd
 4import seaborn as sns
 5import matplotlib.pyplot as plt
 6import pingouin as pg
 7from tqdm import tqdm
 8from scipy.special import softmax, log_softmax
 9from scipy.optimize import minimize
10from scipy.stats import chisquare
11
12
13plt.style.use('bmh')
14plt.rcParams['figure.facecolor'] = 'white'

Research question:

We hypothesize that treatment (could be anything, dopamine manipulations, exposure to stress, etc.) affects humans’ reward sensitivity and the learning rate for a positive outcome (the speed of learning where the optimal choice is).

Paradigm:

We could test this using so-called multi-armed bandit (MAB) paradigm, with a slight modification. In our case, each “arm” can bring reward and punishment at the same time for each trial. The probabilities of outcomes are stationary over time. Subject has to make a decision (choose an “arm”) at each trial. We will use a 4-armed bandit.

We will also have two groups, control (CTR) and treatment (TRT), and compare the parameters between them.

Data generation

For this article, we will generate artificial data, but we will treat it as real-life data (at least most of the time). In other words, we will generate data with a set of parameter values, but then we will try to recover them using the modeling. On the one hand, this will allow to perform a quantitative assessment of model fitting techniques (since we know the ground truth). On the other hand, we can see how to perform a model fitting in a “real-life”.

We will use the model adapted from Seymour et al. (2012)⁵. Parameters of the model are:

learning rate for reward outcomes $\alpha^{\text{reward}}$ (how fast agent learns from the positive feedback)
learning rate for punishment outcomes $\alpha^{\text{punishment}}$ (how fast agent learns from the negative feedback)
reward sensitivity $R$ (how agent perceives positive feedback)
punishment sensitivity $P$ (how agent perceives negative feedback)

Model algorithm:

for each step $t$ in episode do
- $\scriptsize Q(a_j)_t \leftarrow Q(a_j)^{\text{reward}}_t + Q(a_j)^{\text{punishment}}_t$
- Select action $a_t$ using softmax policy $\scriptsize P(a_j = a_t) = \frac{e^{Q_j}}{\sum_{j}e^{Q_j}}$
- Observe reward $r_t$ and punishment $p_t$ values
- for each action $a_j$ in possible actions do
  - if $a_j$ == $a_t$ do
    - $\scriptsize Q(a_j)^{\text{reward}}_{t+1} - Q(a_j)^{\text{reward}}_t + \alpha^{\text{reward}} \left( R \cdot r_t - Q(a_j)^{\text{reward}}_t \right)$
    - $\scriptsize Q(a_j)^{\text{punishment}}_{t+1} \leftarrow Q(a_j)^{\text{punishment}}_t + \alpha^{\text{punishment}} \left( P \cdot p_t - Q(a_j)^{\text{punishment}}_t \right)$
  - else
    - $\scriptsize Q(a_j)^{\text{reward}}_{t+1} \leftarrow Q(a_j)^{\text{reward}}_t + \alpha^{\text{reward}} \left( - Q(a_j)^{\text{reward}}_t \right)$
    - $\scriptsize Q(a_j)^{\text{punishment}}_{t+1} \leftarrow Q(a_j)^{\text{punishment}}_t + \alpha^{\text{punishment}} \left( - Q(a_j)^{\text{punishment}}_t \right)$
  - end
- end
end

A note on terminology. In neuroscience settings, we usually call participants “subjects”, whereas in RL we usually refer to them as “agents” (which may also have more abstract meaning). Also, a set of consecutive trials is usually called a “session” in neuroscience, but an “episode” in RL. In this article we may use these terms interchangeably.

Code

  1def generate_agent_data(
  2    alpha_rew, alpha_pun, rew_sens,
  3    pun_sens, rew_prob, pun_prob, rew_vals,
  4    n_arms=4, n_trials=100):
  5    """
  6    Simulate agent data for N-armed bandit task.
  7        
  8    Arguments
  9    ----------
 10    alpha_rew : float
 11        Learning rate for rewarding trials. Should be in a (0, 1) range.
 12        
 13    alpha_pun : float
 14        Learning rate for punishing trials. Should be in a (0, 1) range.
 15        
 16    rew_sens : float
 17        Reward sensitivity (R)
 18        
 19    pun_sens : float
 20        Punishment sensitivity (P)
 21        
 22    rew_prob : array_like
 23        Probability of reward for each arm.
 24        
 25    pun_prob : array_like
 26        Probability of punishment for each arm.
 27        
 28    rew_vals : array_like
 29        Values of reward (first element) and punishment (second element).
 30        For example [100, -50] means that agent receives 100 a.u. during 
 31        the reward and looses 50 a.u. during the punishment.
 32        
 33    n_arms : int, default=4
 34        Number of arms in N-armed bandit task.
 35        
 36    n_trials : int, default=100
 37        Number of simulated trials for an agent
 38        
 39    Returns
 40    ----------
 41    actions : array of shape (n_trials,)
 42        Selected action by the agent at each trial.
 43        
 44    gains : array of shape (n_trials,)
 45        Agent's reward value at each trial.
 46    
 47    losses : array of shape (n_trials,)
 48        Agent's punishment value at each trial.
 49    
 50    Qs : array of shape (n_trials, n_arms)
 51        Value function for each arm at each trial.
 52    """
 53    
 54    actions = np.zeros(shape=(n_trials,), dtype=np.int32)
 55    gains = np.zeros(shape=(n_trials,), dtype=np.int32)
 56    losses = np.zeros(shape=(n_trials,), dtype=np.int32)
 57    Qs = np.zeros(shape=(n_trials, n_arms))
 58    
 59    # initial uniform values for both reward and punishment value function
 60    Q_rew = np.ones(shape=(n_arms,)) 
 61    Q_rew /= n_arms
 62    
 63    Q_pun = np.ones(shape=(n_arms,)) 
 64    Q_pun /= n_arms
 65
 66    Qs[0] = Q_rew + Q_pun
 67
 68    for i in range(n_trials):
 69
 70        # choose the action based of softmax function
 71        prob_a = softmax(Qs[i])
 72        a = np.random.choice(a=range(n_arms), p=prob_a) # select the action
 73        # list of actions that were not selected
 74        a_left = list(range(n_arms)) 
 75        a_left.remove(a) 
 76
 77        # reward 
 78        if np.random.rand() < rew_prob[a]: # if arm brings reward
 79            r = rew_vals[0]
 80        else:
 81            r = 0
 82
 83        gains[i] = r
 84        # value function update for a chosen arm
 85        Q_rew[a] += alpha_rew * (rew_sens*r - Q_rew[a])
 86        # value function update for non-chosen arms
 87        for a_l in a_left:
 88            Q_rew[a_l] += alpha_rew*(-Q_rew[a_l])
 89                
 90        # punishment
 91        if np.random.rand() < pun_prob[a]:  # if arm brings punishment
 92            r = rew_vals[1]
 93        else:
 94            r = 0
 95
 96        losses[i] = r
 97        Q_pun[a] += alpha_pun * (pun_sens*r - Q_pun[a])
 98        for a_l in a_left:
 99            Q_pun[a_l] += alpha_pun*(-Q_pun[a_l])
100        
101        # save the records
102        actions[i] = a
103
104        if i < n_trials-1:
105            Qs[i+1] = Q_rew + Q_pun
106
107    return actions, gains, losses, Qs

Initial parameters:

Probabilities of reward for each option are: 10%, 30%, 60% and 90%.
Probabilities of punishment for each option are: 90%, 60%, 30% and 10%.
Value of the reward is $+1$ , value of the punishment is $-1$ .
Each group (CTR and TRT) will consist of 10 subjects.
Each subject will perform 200 trials.

We can see, that an “ideal” agent would learn that 4-th option is the best choice to maximize the total reward, since it has the highest probability of reward and lowest probability of punishment.

Code

 1# initial parameters
 2rew_probs = np.array([.1, .3, .6, .9]) # probability of reward of each arm
 3pun_probs = np.array([.9, .6, .3, .1]) # probability of punishment of each arm
 4rew_vals = np.array([1, -1]) # reward = 1, punishment = -1
 5n_subj = 10 # n subjects in each group
 6n_arms = 4
 7n_trials = 200
 8
 9# ids of agents
10ids = {'Control': list(range(1, 11)),
11       'Treatment': list(range(11, 21))}
12
13params = ['alpha_rew', 'alpha_pun', 'R', 'P']
14param_names = {
15    'alpha_rew': 'Learning rate (reward)',
16    'alpha_pun': 'Learning rate (punishment)',
17    'R': 'Reward sensitivity',
18    'P': 'Punishment sensitivity'}

 1# data generation
 2np.random.seed(1)
 3agent_data = pd.DataFrame()
 4true_params = {}
 5
 6for subj in itertools.chain(*ids.values()):
 7
 8    true_params[subj] = {}
 9    
10    if subj in ids['Control']:
11        # control group
12        group = 'Control'
13        alpha_rew = np.random.randint(low=30, high=50) / 100 # RV in [0.3, 0.5]
14        alpha_pun = np.random.randint(low=20, high=40) / 100 # RV in [0.2, 0.4]
15        rew_sens = np.random.randint(low=20, high=50) / 10 # RV in [2., 5.]
16        pun_sens = np.random.randint(low=30, high=70) / 10 # RV in [3., 7.]
17    elif subj in ids['Treatment']:
18        # treatment group
19        group = 'Treatment'
20        alpha_rew = np.random.randint(low=10, high=25) / 100 # RV in [0.1, 0.25]
21        alpha_pun = np.random.randint(low=20, high=40) / 100 # RV in [0.2, 0.4]
22        rew_sens = np.random.randint(low=15, high=50) / 10 # RV in [1., 5.]
23        pun_sens = np.random.randint(low=30, high=70) / 10 # RV in [3., 7.]
24    else:
25        raise ValueError(f'No subject with id {subj} found!')
26    
27    actions, gain, loss, Qs = generate_agent_data(
28        alpha_rew=alpha_rew, 
29        alpha_pun=alpha_pun,
30        rew_sens=rew_sens,
31        pun_sens=pun_sens,
32        rew_prob=rew_probs,
33        pun_prob=pun_probs,
34        rew_vals=rew_vals,
35        n_trials=n_trials)
36
37    # add generated data to temporary data frame
38    temp_df = pd.DataFrame({
39        'group': group, 'subjID': subj, 'trial': range(1, n_trials+1),
40        'choice': actions+1, 'gain': gain, 'loss': loss})
41
42    agent_data = agent_data.append(temp_df)
43
44    # update dictionary with true parameter values
45    true_params[subj]['alpha_rew'] = alpha_rew
46    true_params[subj]['alpha_pun'] = alpha_pun
47    true_params[subj]['R'] = rew_sens
48    true_params[subj]['P'] = pun_sens

Sample of generated data:

Table 1: Sample of the generated data
	group	subjID	trial	choice	gain	loss
0	Control	1	1	1	0	-1
1	Control	1	2	1	0	-1
2	Control	1	3	3	1	0
3	Control	1	4	3	1	0
4	Control	1	5	1	0	-1
5	Control	1	6	3	1	-1
6	Control	1	7	4	0	0
7	Control	1	8	3	0	0
8	Control	1	9	1	1	-1
9	Control	1	10	4	1	0

Even though this data is generated, that’s the exact data you can get from a real behavioral experiment. Each row in a dataset represents a trial for a particular subject. choice column represents the option chosen, gain and loss represent the reward and punishment values for a given trial. Full generated data set can be found here.

Code

1true_params_df = pd.DataFrame(true_params).T.reset_index(drop=False)
2true_params_df.rename(columns={'index': 'subjID'}, inplace=True)
3true_params_df['group'] = true_params_df['subjID'].apply(
4    lambda x: 'Control' if x in ids['Control'] else (
5        'Treatment' if x in ids['Treatment'] else 'Undefined'
6    )
7)

 1# plot boxplots with true parameter values for each group
 2plt.figure(figsize=(11,6))
 3for (i, param) in enumerate(params):
 4    plt.subplot(2,2,i+1)
 5    sns.boxplot(
 6        y=param, x='group',
 7        data=true_params_df, linewidth=3,
 8        palette='Set3')
 9    sns.stripplot(
10        y=param, x='group', data=true_params_df,
11        jitter=0.1, linewidth=2, palette='Set3')
12    if i < 2:
13        plt.xlabel('')
14        plt.ylim([0, .55]) # use the same scale for each parameter type
15    else:
16        plt.xlabel('Group', fontweight='bold')
17        plt.ylim([1, 8])
18        
19    plt.ylabel('')
20    plt.title(param_names[param], fontweight='bold')
21
22plt.suptitle('True parameters of agents', fontsize=15, fontweight='bold')
23plt.tight_layout()
24plt.show()

That’s how true parameters are distributed. Visually, it looks like TRT group on average has a lower learning rate for positive feedback compared to CTR.

Reminder, that we don’t have information about the true parameter values in real life. We have access to them here since we are working with the simulated data.

Maximum Likelihood Estimation

Now we want to recover the parameters of subjects and basically check if our hypothesis is true and treatment affects the learning rate for positive feedback and sensitivity to a reward. As told before, there are two main approaches to do this, fitting the model using MLE and Bayesian way. Here we will focus on the MLE.

The idea of MLE is pretty simple, however, math under the hood is a bit more sophisticated. We want to find such set of parameters $\hat{\Theta} = { \\{ \alpha^{\text{reward}}, \alpha^{\text{punishment}}, R, P \\} }$ , that is more likely to be true. In a simplified version, one can start it by choosing 4 random values for $\hat{\Theta}$ , putting them into the algorithm described in data generation part, and then answering the question “What is the probability that actions were selected given these four parameters” (or in a notation way, $P(\text{data}|\Theta)$ ). After this, one would have to change the values and answer the same question again for a new set of values. After $n$ of such runs, the winning set of parameters is the one that gives the highest probability of the data occurrence. Or in other words, it is the maximum likelihood estimator.

But of course, no one would do that by hand and we can use function optimization techniques to deal with it. In Python we can use minimize function from SciPy package to find the estimated values with the help of limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm. This algorithm is a variation of a gradient descent optimization method.

Since the function is looking for a minimum value, we will change the sign of likelihood in our function (and now function minimize will find the minimum value, while it is actually the highest). Also, we are going to use the logarithm of likelihood, not the likelihood value itself.

Code

  1def log_lik(x, *args):
  2    """
  3    Sum of negative log likelihoods over all trials.
  4
  5    Arguments
  6    ----------
  7    x : tuple, list
  8        Parameters to estimate by MLE fit.
  9
 10    args: tuple, list
 11        Input parameters for the model.
 12
 13    Returns
 14    ----------
 15        Sum of negative log likelihoods.
 16    """
 17
 18    alpha_rew, alpha_pun, rew_sens, pun_sens = x # parameters to estimate
 19    n_arms, actions, gain, loss = args # input values
 20
 21    Q_rew = np.ones(shape=(n_arms,)) 
 22    Q_rew /= n_arms
 23
 24    Q_pun = np.ones(shape=(n_arms,)) 
 25    Q_pun /= n_arms
 26
 27    log_prob_actions = np.zeros(shape=(len(actions), n_arms)) 
 28
 29    for i, (a, g, l) in enumerate(zip(actions, gain, loss)):
 30
 31        a_left = list(range(n_arms)) # actions that were not selected
 32        a_left.remove(a)
 33
 34        log_prob_action = log_softmax(Q_rew + Q_pun)
 35        log_prob_actions[i] = log_prob_action[a]
 36
 37        Q_rew[a] += alpha_rew * (rew_sens*g - Q_rew[a])
 38        for a_l in a_left:
 39            Q_rew[a_l] += alpha_rew*(-Q_rew[a_l])
 40
 41        Q_pun[a] += alpha_pun * (pun_sens*l - Q_pun[a])
 42        for a_l in a_left:
 43            Q_pun[a_l] += alpha_pun*(-Q_pun[a_l])
 44
 45    return -np.sum(log_prob_actions)
 46    
 47    
 48def mle_fit(x0, fun, params, bounds):
 49    """
 50    Fit model parameters using MLE approach.
 51
 52    Arguments
 53    ----------
 54    x0 : array-like
 55        Initial value for each of the parameter.
 56
 57    fun : function
 58        Function name to use for optimization.
 59
 60    params : list
 61        List of parameter names.
 62
 63    bounds : tuple
 64        Tuple of the lower and upper bounds for parameters.
 65
 66    Returns
 67    ----------
 68    output : dict
 69        Dictionary with the estimated parameters for each agent.
 70
 71    Example call
 72    ----------
 73    mle_fit(
 74        x0=[0.5, 2], fun=logLike,
 75        params=['alpha', 'beta'],
 76        bounds=((0.1, 1), (0, 10))
 77    """
 78
 79    output = {}
 80
 81    for subj in tqdm(agent_data['subjID'].unique(), desc=f'Fit with x0: {x0}'):
 82
 83        output[subj] = {}
 84        cond = agent_data['subjID'] == subj
 85        actions = (agent_data[cond]['choice'] - 1).to_list()
 86        gain = agent_data[cond]['gain'].to_list()
 87        loss = agent_data[cond]['loss'].to_list()
 88
 89        result = minimize(
 90            fun=fun, x0=x0,
 91            method='L-BFGS-B', # Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm
 92            args=(n_arms, actions, gain, loss),
 93            bounds=bounds)
 94
 95        for (j,param) in enumerate(params):
 96            output[subj][param] = result.x[j]
 97        output[subj]['likelihood'] = result.fun
 98
 99    return output
100
101
102def best_params(all_params_list):
103    """
104    Find the best fitting parameters from a set of parameters.
105
106    Arguments
107    ----------
108    all_params_list : list
109        List with the dictionaries of fitted value.
110        Each dictionary should look as following:
111        {<subject>: {<parameter1>: <value>, ..., <likelihood>: <value>}}
112
113    Returns
114    ----------
115    selected_params : dict
116        Dictionary with the best fitted values.
117    """
118
119    # dictionary {<subject>: [<likelihood1>, <likelihood2>, ...]}
120    likelihoods = {}
121    for subj in itertools.chain(*ids.values()):
122        likelihoods[subj] = []
123        for param_set in all_params_list:
124            likelihoods[subj].append(param_set[subj]['likelihood'])
125
126    # select the lowest log likelihood among given models
127    # {<subject>: <index of a set with the lowest log likelihood>}
128    selected_set = {subj: np.argmin(lik) for (subj, lik) in likelihoods.items()}
129
130    # new set of parameters for each subject with best fitted values
131    selected_params = {}
132    for subj in selected_set.keys():
133        selected_params[subj] = all_params_list[selected_set[subj]][subj]
134
135    return selected_params

Different starting values

Before we start fitting, there is one big issue with MLE that we should be aware of. The algorithm can be stuck in a local minimum, meaning that the found set of parameters is not optimal in the entire parameter space. One of the ways to deal with this problem is to use several starting points. Here we will run MLE for 10 random starting values and choose the set that gives the minimum negative log likelihood for each subject.

We will also use bounds for the parameters (learning rates should be in a range $\[ 0, 1\]$ and sensitivities should be in a range $\[ 0, 8 \]$ ), however Daw (2009)³ argues against that.

Code

 1all_mle_params = []
 2
 3for i in range(10):
 4    # generate random initial values
 5    np.random.seed(i)
 6    alpha0 = list(np.random.random(size=(2,)).round(2))
 7    sens0 = list(np.random.randint(low=0, high=80, size=(2,)) / 10)
 8    x0 = alpha0 + sens0
 9
10    mle_output = mle_fit(
11        x0=x0, fun=log_lik,
12        params=params,
13        bounds=((0, 1), (0, 1), (0, 8), (0, 8)))
14
15    all_mle_params.append(mle_output)
16
17selected_params = best_params(all_mle_params)

Fit with x0: [0.55, 0.72, 6.7, 6.7]: 100%|██████████| 20/20 [00:38<00:00,  1.91s/it]
Fit with x0: [0.42, 0.72, 0.9, 7.5]: 100%|██████████| 20/20 [00:36<00:00,  1.81s/it]
Fit with x0: [0.44, 0.03, 2.2, 4.3]: 100%|██████████| 20/20 [00:35<00:00,  1.78s/it]
Fit with x0: [0.55, 0.71, 5.6, 7.2]: 100%|██████████| 20/20 [00:27<00:00,  1.38s/it]
Fit with x0: [0.97, 0.55, 0.1, 7.2]: 100%|██████████| 20/20 [00:45<00:00,  2.25s/it]
Fit with x0: [0.22, 0.87, 1.6, 7.3]: 100%|██████████| 20/20 [00:34<00:00,  1.74s/it]
Fit with x0: [0.89, 0.33, 7.9, 6.2]: 100%|██████████| 20/20 [00:39<00:00,  1.98s/it]
Fit with x0: [0.08, 0.78, 6.7, 2.3]: 100%|██████████| 20/20 [00:34<00:00,  1.75s/it]
Fit with x0: [0.87, 0.97, 0.5, 0.8]: 100%|██████████| 20/20 [00:38<00:00,  1.91s/it]
Fit with x0: [0.01, 0.5, 5.4, 5.6]: 100%|██████████| 20/20 [00:41<00:00,  2.09s/it]

Once we have selected the best fitting values, the following questions may arise:

Are we sure that the selected values represent the global minimum? We are not sure.
Were 10 initial starting points enough? We don’t know.
Would another optimization method find a better set of parameters? It could, but the overview of different optimization methods is beyond the scope of this article.

It may sound vague, but that’s the sad truth about MLE, we are never sure that we’ve found the best parameters. Although, by trying different methods and initial values we can make it as good as possible.

Code

1# convert dict to dataframe
2mle_params_df = pd.DataFrame(selected_params).T.reset_index(drop=False)
3mle_params_df.rename(columns={'index': 'subjID'}, inplace=True)
4mle_params_df['group'] = mle_params_df['subjID'].apply(
5    lambda x: 'Control' if x in ids['Control'] else (
6        'Treatment' if x in ids['Treatment'] else 'Undefined'
7    )
8)

 1# plot estimated parameters
 2plt.figure(figsize=(11,6))
 3for (i, param) in enumerate(params):
 4    plt.subplot(2,2,i+1)
 5    sns.boxplot(
 6        y=param, x='group',
 7        data=mle_params_df, linewidth=3,
 8        palette='Set3')
 9    sns.stripplot(
10        y=param, x='group', data=mle_params_df,
11        jitter=0.1, linewidth=2, palette='Set3')
12    if i < 2:
13        plt.xlabel('')
14        plt.ylim([-0.05, 1.05]) # use the same scale for each parameter type
15    else:
16        plt.xlabel('Group', fontweight='bold')
17        plt.ylim([0, 12])
18        
19    plt.ylabel('')
20    plt.title(param_names[param], fontweight='bold')
21
22plt.suptitle('Estimated parameters of agents (MLE)', fontsize=15, fontweight='bold')
23plt.tight_layout()
24plt.show()

Now, we can actually estimate how good the selected parameters are.

Code

 1# plot estimated parameters vs true parameters
 2plt.figure(figsize=(8,8))
 3
 4for (i,param) in enumerate(params):
 5
 6    plt.subplot(2,2,i+1)
 7    plt.axis('square')
 8    sns.scatterplot(x=mle_params_df[param], y=true_params_df[param])
 9
10    r_coef = pg.corr(true_params_df[param], mle_params_df[param])['r'][0]
11    p_val = pg.corr(true_params_df[param], mle_params_df[param])['p-val'][0]
12
13    if i < 2:
14        lim = [0,1]
15        plt.plot(lim, lim, ls="--", c=".3")
16        plt.xlim(lim)
17        plt.ylim(lim)
18    else:
19        lim = [0,10]
20        plt.plot(lim, lim, ls="--", c=".3")
21        plt.xlim(lim)
22        plt.ylim(lim)
23
24    plt.xlabel('MLE', fontweight='bold')
25    plt.ylabel('True', fontweight='bold')
26    plt.title(
27        f'{param_names[param]}\nr = {r_coef:.2f} (p-val = {p_val:.2f})',
28        fontweight='bold')
29
30plt.tight_layout()
31plt.show()

In the perfect world, all dots would lie on a diagonal line, meaning that fitted values (on the $x$ axis) match the real values (on the $y$ axis). In our case, we can see that MLE fit recovered values for reward sensitivity quite accurately, whereas punishment sensitivity values are not that aligned.

Yet another reminder, that we are able to make this type of comparison since we are using the generated data. In real life, we don’t have access to the true parameters.

Another caution is that we have 2 estimated values for punishment sensitivity that are equal to or are really close to 8, which was the upper threshold for the optimizing function. This could mean that a better fitting value lies beyond the threshold value.

Model Comparison

By this point we have found the set of parameters for a model that fit data quite well, but what if our model is wrong? MLE will always return a set of best-fitting parameters, but if the chosen model doesn’t represent the actual mechanism of subjects’ decision making, the results are unreliable. One can deal with such an issue by trying out several models and comparing them.

What if our subjects don’t really have the “competing” sides of positive and negative feedback and the learning rate and sensitivity are the same for both types of an outcome? We will also fit model to data and call this model “No Competition”. Another possible scenario is that subjects don’t update the action values for non-selected actions. We will call this model “No update”. The model that we have fit before will be called “Full Model”.

Now, in order to check these models, we just have to change the log likelihood function that we want to optimize. Once we find the likelihoods of the model, we can compare the models. There are several ways of doing it, but the most common ones are with the help of Akaike information criterion (AIC) or Bayesian information criterion (BIC):

AIC = 2k - 2 \text{ln} (\mathcal{\hat{L}})

BIC = k \ \text{ln}(n) - 2 \text{ln} (\mathcal{\hat{L}})

$k$ - number of estimated parameters in the model
$\mathcal{\hat{L}}$ - maximum value of the likelihood function for the model
$n$ - the number of observations (number of trials in our case)

The first part for both equations (before the minus sign) represents the model complexity, whereas the second part (after the minus sign) represents the model performance. Basically, both criteria are looking for a better performing and less complex model. In practice, the lower the values, the better.

Model algorithm for “No Competition” model:

for each step $t$ in episode do
- Select action $a_t$ using softmax policy $\scriptsize P(a_j = a_t) = \frac{e^{Q_j}}{\sum_{j}e^{Q_j}}$
- Observe reward $r_t$ and punishment $p_t$ values. Outcome $\scriptsize o_t = r_t + p_t$
- for each action $a_j$ in possible actions do
  - if $a_j$ == $a_t$ do
    - $\scriptsize Q(a_j)_{t+1} \leftarrow Q(a_j)_t + \alpha \left( S \cdot o_t - Q(a_j)_t \right)$
  - else
    - $\scriptsize Q(a_j)_{t+1} \leftarrow Q(a_j)_t + \alpha \left( - Q(a_j)_t \right)$
  - end
- end
end

Model algorithm for “No Update” model:

for each step $t$ in episode do
- $\scriptsize Q(a_j)_t \leftarrow Q(a_j)^{\text{reward}}_t + Q(a_j)^{\text{punishment}}_t$
- Select action $a_t$ using softmax policy $\scriptsize P(a_j = a_t) = \frac{e^{Q_j}}{\sum_{j}e^{Q_j}}$
- Observe reward $r_t$ and punishment $p_t$ values
- $\scriptsize Q(a_t)^{\text{reward}}_{t+1} - Q(a_t)^{\text{reward}}_t + \alpha^{\text{reward}} \left( R \cdot r_t - Q(a_t)^{\text{reward}}_t \right)$
- $\scriptsize Q(a_t)^{\text{punishment}}_{t+1} \leftarrow Q(a_t)^{\text{punishment}}_t + \alpha^{\text{punishment}} \left( P \cdot p_t - Q(a_t)^{\text{punishment}}_t \right)$
end

Code

  1def log_lik_no_compet(x, *args):
  2    """
  3    Sum of negative log likelihoods over all trials for "No Competition" model.
  4
  5    Arguments
  6    ----------
  7    x : tuple, list
  8        Parameters to estimate by MLE fit.
  9
 10    args: tuple, list
 11        Input parameters for the model.
 12
 13    Returns
 14    ----------
 15        Sum of negative log likelihoods.
 16    """
 17
 18    alpha, outcome_sens = x # parameters to estimate
 19    n_arms, actions, gain, loss = args # input values
 20
 21    Qs = np.ones(shape=(n_arms,))
 22    Qs /= n_arms
 23
 24    log_prob_actions = np.zeros(shape=(len(actions), n_arms))
 25
 26    for i, (a, g, l) in enumerate(zip(actions, gain, loss)):
 27
 28        r = g + l
 29
 30        a_left = list(range(n_arms)) # actions that were not selected
 31        a_left.remove(a)
 32
 33        log_prob_action = log_softmax(Qs)
 34        log_prob_actions[i] = log_prob_action[a]
 35
 36        Qs[a] += alpha * (outcome_sens*r - Qs[a])
 37        for a_l in a_left:
 38            Qs[a_l] += alpha*(-Qs[a_l])
 39
 40    return -np.sum(log_prob_actions)
 41
 42
 43def log_lik_no_update(x, *args):
 44    """
 45    Sum of negative log likelihoods over all trials for "No Update" model.
 46
 47    Arguments
 48    ----------
 49    x : tuple, list
 50        Parameters to estimate by MLE fit.
 51
 52    args: tuple, list
 53        Input parameters for the model.
 54
 55    Returns
 56    ----------
 57        Sum of negative log likelihoods.
 58    """
 59
 60    alpha_rew, alpha_pun, rew_sens, pun_sens = x # parameters to estimate
 61    n_arms, actions, gain, loss = args # input values
 62
 63    Q_rew = np.ones(shape=(n_arms,))
 64    Q_rew /= n_arms
 65
 66    Q_pun = np.ones(shape=(n_arms,))
 67    Q_pun /= n_arms
 68
 69    log_prob_actions = np.zeros(shape=(len(actions), n_arms))
 70
 71    for i, (a, g, l) in enumerate(zip(actions, gain, loss)):
 72
 73        log_prob_action = log_softmax(Q_rew + Q_pun)
 74        log_prob_actions[i] = log_prob_action[a]
 75
 76        Q_rew[a] += alpha_rew * (rew_sens*g - Q_rew[a])
 77        Q_pun[a] += alpha_pun * (pun_sens*l - Q_pun[a])
 78
 79    return -np.sum(log_prob_actions)
 80
 81
 82def aic(k, L):
 83    """
 84    Get the value of Akaike information criterion (AIC).
 85
 86    Arguments
 87    ----------
 88    k : int
 89        Number of estimated parameters in the model
 90
 91    L : float
 92        Logarithm of a maximum value of the likelihood function for the model
 93
 94    Returns
 95    ----------
 96        AIC value
 97    """
 98
 99    return 2*k - 2*L
100
101def bic(k, n, L):
102    """
103    Get the value of Bayesian information criterion (BIC).
104
105    Arguments
106    ----------
107    k : int
108        Number of estimated parameters in the model.
109
110    n : int
111        The number of observations.
112
113    L : float
114        Logarithm of a maximum value of the likelihood function for the model
115
116    Returns
117    ----------
118        BIC value
119    """
120
121    return k*np.log(n) - 2*L

For both new models, we will also fit 10 different initial parameters and select the optimal ones before comparing models between each other.

Code

 1all_mle_params_no_compet = []
 2
 3for i in range(10):
 4    # generate random initial values
 5    np.random.seed(i)
 6    alpha0 = round(np.random.random(), 2)
 7    sens0 = np.random.randint(low=0, high=80) / 10
 8    x0 = [alpha0, sens0]
 9
10    mle_output = mle_fit(
11        x0=x0, fun=log_lik_no_compet,
12        params=['alpha', 'outcome_sens'],
13        bounds=((0, 1), (0, 8)))
14
15    all_mle_params_no_compet.append(mle_output)
16
17selected_params_no_compet = best_params(all_mle_params_no_compet)

Fit with x0: [0.55, 6.4]: 100%|██████████| 20/20 [00:12<00:00,  1.61it/s]
Fit with x0: [0.42, 1.2]: 100%|██████████| 20/20 [00:16<00:00,  1.18it/s]
Fit with x0: [0.44, 7.2]: 100%|██████████| 20/20 [00:09<00:00,  2.08it/s]
Fit with x0: [0.55, 0.3]: 100%|██████████| 20/20 [00:20<00:00,  1.01s/it]
Fit with x0: [0.97, 5.5]: 100%|██████████| 20/20 [00:09<00:00,  2.16it/s]
Fit with x0: [0.22, 6.1]: 100%|██████████| 20/20 [00:12<00:00,  1.63it/s]
Fit with x0: [0.89, 7.9]: 100%|██████████| 20/20 [00:08<00:00,  2.29it/s]
Fit with x0: [0.08, 2.5]: 100%|██████████| 20/20 [00:14<00:00,  1.39it/s]
Fit with x0: [0.87, 0.5]: 100%|██████████| 20/20 [00:19<00:00,  1.01it/s]
Fit with x0: [0.01, 5.4]: 100%|██████████| 20/20 [00:24<00:00,  1.24s/it]

 1all_mle_params_no_update = []
 2
 3for i in range(10):
 4    # generate random initial values
 5    np.random.seed(i*10)
 6    alpha0 = list(np.random.random(size=(2,)).round(2))
 7    sens0 = list(np.random.randint(low=0, high=80, size=(2,)) / 10)
 8    x0 = alpha0 + sens0
 9
10    mle_output = mle_fit(
11        x0=x0, fun=log_lik_no_update,
12        params=params,
13        bounds=((0, 1), (0, 1), (0, 8), (0, 8)))
14
15    all_mle_params_no_update.append(mle_output)
16
17selected_params_no_update = best_params(all_mle_params_no_update)

Fit with x0: [0.55, 0.72, 6.7, 6.7]: 100%|██████████| 20/20 [00:18<00:00,  1.10it/s]
Fit with x0: [0.77, 0.02, 6.4, 2.8]: 100%|██████████| 20/20 [00:15<00:00,  1.28it/s]
Fit with x0: [0.59, 0.9, 2.8, 0.9]: 100%|██████████| 20/20 [00:25<00:00,  1.27s/it]
Fit with x0: [0.64, 0.38, 1.2, 2.3]: 100%|██████████| 20/20 [00:35<00:00,  1.79s/it]
Fit with x0: [0.41, 0.06, 5.6, 5.0]: 100%|██████████| 20/20 [00:22<00:00,  1.10s/it]
Fit with x0: [0.49, 0.23, 3.3, 0.4]: 100%|██████████| 20/20 [00:32<00:00,  1.60s/it]
Fit with x0: [0.3, 0.19, 1.5, 7.2]: 100%|██████████| 20/20 [00:44<00:00,  2.21s/it]
Fit with x0: [0.93, 0.87, 6.0, 5.9]: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]
Fit with x0: [0.52, 0.7, 1.0, 5.0]: 100%|██████████| 20/20 [00:45<00:00,  2.26s/it]
Fit with x0: [0.15, 0.16, 6.7, 3.9]: 100%|██████████| 20/20 [00:46<00:00,  2.34s/it]

 1# combine selected value for each model in one list
 2all_models = [selected_params, selected_params_no_update, selected_params_no_compet]
 3model_names = ['Full Model', 'No Update', 'No Competition']
 4
 5metrics = {}
 6# dictionary will look like
 7# {<subject>:
 8#     {<Model Name>:
 9#         {'AIC': <value>},
10#         {'BIC': <value>}
11#     }
12# }
13
14for subj in itertools.chain(*ids.values()):
15
16    metrics[subj] = {}
17
18    for (i,m) in enumerate(all_models):
19
20        metrics[subj][model_names[i]] = {}
21
22        metrics[subj][model_names[i]]['AIC'] = int(aic(
23            k=len(m[subj])-1,
24            L=-m[subj]['likelihood']))
25
26        metrics[subj][model_names[i]]['BIC'] = int(bic(
27            k=len(m[subj])-1,
28            n=n_trials,
29            L=-m[subj]['likelihood']))

 1# convert dictionary to data frame for better visualization
 2metrics_df = pd.DataFrame({})
 3
 4for (i, subj) in enumerate(itertools.chain(*ids.values())):
 5    temp = pd.DataFrame(metrics[subj]).reset_index(drop=False)
 6    temp.rename(columns={'index': 'metric'}, inplace=True)
 7    temp['subjID'] = subj
 8    metrics_df = metrics_df.append(temp)
 9
10metrics_df.set_index(['subjID', 'metric']).style.highlight_min(
11    axis=1,
12    props='color:black; font-weight:bold; background-color:lightgreen;')

Table 2: Model comparison for all subjects
		Full Model	No Update	No Competition
subjID	metric
1	AIC	1696	1807	1692
1	BIC	1709	1821	1699
2	AIC	1257	1479	1281
2	BIC	1270	1492	1288
3	AIC	1044	1262	1054
3	BIC	1057	1275	1060
4	AIC	1802	1922	1806
4	BIC	1816	1935	1812
5	AIC	863	1046	882
5	BIC	876	1059	889
6	AIC	924	1058	930
6	BIC	937	1072	936
7	AIC	1470	1733	1516
7	BIC	1483	1747	1523
8	AIC	1097	1253	1098
8	BIC	1110	1266	1105
9	AIC	1066	1472	1089
9	BIC	1079	1485	1095
10	AIC	1710	1821	1749
10	BIC	1724	1834	1756
11	AIC	1987	2147	2064
11	BIC	2000	2160	2070
12	AIC	1805	1874	1857
12	BIC	1818	1887	1864
13	AIC	1967	2100	1988
13	BIC	1980	2113	1994
14	AIC	1455	1572	1489
14	BIC	1468	1585	1495
15	AIC	998	1135	1042
15	BIC	1011	1148	1049
16	AIC	1975	2050	2035
16	BIC	1988	2064	2042
17	AIC	1272	1410	1307
17	BIC	1285	1424	1314
18	AIC	1897	2039	1927
18	BIC	1911	2052	1933
19	AIC	1889	1932	1894
19	BIC	1902	1946	1901
20	AIC	1627	1724	1658
20	BIC	1640	1737	1664

As one can see from the results table, BIC and AIC values don’t always agree. And even though “No Competition” model showed lower BIC and AIC values for some of the subjects, we will select the “Full Model” as a winning one, since it fits better for the majority of subjects.

Comparison to the Random Choice Model

One can also check whether the subject performed actions at random with the help of “classical” statistical approaches, such as Chi-square test.

Our null hypothesis is that actions were selected at random. If this was true in our case, then for 200 trials, we would see 50 selections of each option. An alternative hypothesis is that number of selections of each option is different from 50. Significance level $\alpha$ is going to be 0.05. We will perform the Chi-square test for each subject separately.

Code

 1alpha = 0.05
 2chisq_results = {}
 3
 4for subj in itertools.chain(*ids.values()):
 5    # get the list of observed counts of choices
 6    obs = agent_data[agent_data['subjID'] == subj]['choice']\
 7      .value_counts()\
 8      .sort_index()\
 9      .to_list()
10    chisq_results[subj] = chisquare(f_obs=obs)[1]
11
12if np.sum(np.array(list(chisq_results.values())) < alpha) == n_subj*2:
13    print('Rejecting the null hypothesis for all subjects.')
14else:
15    print('Some subject may have chosen actions at random.')

Rejecting the null hypothesis for all subjects.

Results above show that we were able to reject the null hypothesis for all the subjects, meaning that the counts of choices are not equal (or in other words, choices were not selected at random) at the significance level of 0.05.

Model Validation

After we have selected the model, we could proceed to the model validation. In this step, we try to replicate the original data with the selected parameters and check whether they fit. We will use the same function generate_agent_data that we used to generate the initial set using the estimated values from MLE results and see how artificial data differs from the original one. As a measure of comparison, we can check the ratio of selection of 4-th window (the one with the highest reward probability and lowest punishment probability). This will allow us to indirectly compare the action value functions (if estimated parameters were close to the real parameters, the action value function over the trials would be approximately the same and thus the probability of choosing each option would also be roughly the same).

Code

 1# data generation
 2sim_data = pd.DataFrame()
 3
 4for subj in itertools.chain(*ids.values()):
 5
 6    actions, gain, loss, Qs = generate_agent_data(
 7        alpha_rew=selected_params[subj]['alpha_rew'],
 8        alpha_pun=selected_params[subj]['alpha_pun'],
 9        rew_sens=selected_params[subj]['R'],
10        pun_sens=selected_params[subj]['P'],
11        rew_prob=rew_probs,
12        pun_prob=pun_probs,
13        rew_vals=rew_vals,
14        n_trials=n_trials)
15
16    temp_df = pd.DataFrame({
17        'subjID': subj, 'choice': actions+1, 'gain': gain, 'loss': loss})
18
19    sim_data = sim_data.append(temp_df)

 1plt.figure(figsize=(10,20))
 2for (j, subj) in enumerate(itertools.chain(*ids.values())):
 3    cumavg_true = []
 4    cumavg_sim = []
 5
 6    plt.subplot(10, 2, j+1)
 7    for i in range(n_trials):
 8        cumavg_true.append(np.mean(agent_data[agent_data['subjID'] == subj]['choice'][:i+1] == 4))
 9        cumavg_sim.append(np.mean(sim_data[sim_data['subjID'] == subj]['choice'][:i+1] == 4))
10
11    plt.plot(cumavg_true, color='blue', label='True')
12    plt.plot(cumavg_sim, color='red', label='Simulated')
13    plt.legend()
14    plt.title(f'Subject {subj}', fontweight='bold')
15    plt.ylim([0,1])
16    if (j+2) % 2 == 0: # for odd indeces
17        plt.ylabel('$P(a_t = a_4)$')
18        
19    if j in [18, 19]:
20        plt.xlabel('Trial', fontweight='bold')
21
22plt.suptitle('Probability of choosing the 4th arm', y = 1.01, fontsize=15, fontweight='bold')
23plt.tight_layout()
24plt.show()

Probabilities (of ratio) of selection of the 4th option don’t always perfectly match between true and simulated values, but it’s fairly close. We can make a conclusion, that model captured the real behavior pretty well.

Groups Comparison

Once we have selected the model and confirmed, that it can reproduce the real behavioral data, we can come to the main research question we had in the beginning: does treatment change the reward sensitivity and learning rate for a positive outcome?

We have our selected parameters for each subject, now it comes again to the statistical analysis. We will use a two-sample $t$ -test to see whether there is a significant difference between the samples of parameters at the significance level $\alpha$ of 0.05.

Results

Code

1for param in params:
2    print(f"==Results for {param_names[param]}==")
3    ttest_res = pg.ttest(
4        x=mle_params_df[mle_params_df['group'] == 'Control'][param],
5        y=mle_params_df[mle_params_df['group'] == 'Treatment'][param])
6    display(ttest_res)
7    print()

Results for learning rate (reward):

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	4.188352	18	two-sided	0.000552	[0.1, 0.3]	1.873088	48.259	0.977033

Results for learning rate (punishment):

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	0.169418	18	two-sided	0.867357	[-0.1, 0.12]	0.075766	0.401	0.052959

Results for reward sensitivity:

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	1.981033	18	two-sided	0.063076	[-0.05, 1.79]	0.885945	1.463	0.466243

Results for punishment sensitivity:

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	-0.751089	18	two-sided	0.462306	[-1.77, 0.84]	0.335897	0.485	0.109741

$t$ -test results showed that there is only a significant difference between the learning rate for a reward with the large effect size. One could argue that difference for reward sensitivity is also statistically different with the medium effect size.

Another general issue with this approach is that each estimated value of the parameter has its own value of uncertainty (which can be estimated using the inverse of Hessian matrix), but by doing the $t$ -test (or another alternative test) we are losing this uncertainty and use the values as if they were fixed. Bayesian approach doesn’t have this issue and we can see it in the 2nd part of the article.

Summary and Limitations

We have looked at how one can fit RL model to the behavioral data using MLE approach and select the best fitting values. Then we have introduced the idea of models comparison and model validation. Lastly, we have used statistical analysis to answer the research question using selected model parameters.

However, there are several limitations of this review. First of all, data was artificially generated and parameters were selected for simplicity from a given range. Real behavioral data can be much noisier.

Secondly, it is worth mentioning, that the initial data was generated with the research question in mind (~~cheating~~) to make it more exciting, that’s why the results came up significantly different. Although it doesn’t change the modeling and analysis steps.

►Computational Models of Behavior, part 2: Bayesian Inference

References

Schultz W. (1998). Predictive reward signal of dopamine neurons. Journal of neurophysiology, 80(1), 1–27. https://doi.org/10.1152/jn.1998.80.1.1 ↩︎
Lockwood, P. L., & Klein-Flügge, M. C. (2021). Computational modelling of social cognition and behaviour-a reinforcement learning primer. Social cognitive and affective neuroscience, 16(8), 761–771. https://doi.org/10.1093/scan/nsaa040 ↩︎
Daw, N. (2011). Trial-by-trial data analysis using computational models. Affect, Learning and Decision Making, Attention and Performance XXIII. 23. https://doi.org/10.1093/acprof:oso/9780199600434.003.0001 ↩︎ ↩︎
Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, e49547. https://doi.org/10.7554/eLife.49547 ↩︎
Seymour, B., Daw, N. D., Roiser, J. P., Dayan, P., & Dolan, R. (2012). Serotonin selectively modulates reward value in human decision-making. The Journal of neuroscience : the official journal of the Society for Neuroscience, 32(17), 5833–5842. https://doi.org/10.1523/JNEUROSCI.0053-12.2012 ↩︎