How to Deal with Multi-Armed Bandit Problem
by Ruslan Klymentiev and Zank Bennett
Originally created for Bennett Data Science blog post.
What is Multi-Armed Bandit Problem?
This is a classic case that comes up various areas, from marketing to medical trials: How do you give a user what they want before you have a relationship with them? What about the case where you don’t even know much about the affinity for a user to an item or treatment your giving the user, as in a medical trial of several drugs or supplements?
We’ll show you how to approach these problems, and how a Multi-Armed Bandit with Thompson Sampling is generally the best choice. That’s a lot of words. Let’s slow down and start with something simple.
As an example, imagine you are running a marketing campaign for your website and have two ads to choose from. And say the objective is to show the ad with highest click through rate (CTR) value to drive the highest traffic possible. But you don’t have any prior information about how either of these ads will perform. What would be your approach? What about typical A/B testing? How about showing both of the ads equally, then, at some point, switching to the ad with the highest measured CTR? How long do you have to show both ads before settling on a “winner”? In these cases, it seems like guessing might be our best bet. In fact, that’s not far off!
There’s a method for this very problem; it’s called the multi armed bandit (MAB).
There’s a lot of information explaining what the MAB algorithm is, what it does and how it works, so we’ll keep it brief. Essentially, the MAB algorithm is a near-optimal method for solving the explore exploit trade-off dilemma that occurs when we don’t know whether to explore possible options to find the one with the best payoff or exploit an option that we feel is the best, from the limited exploration done so far.
In this tutorial we will look at different algorithms to solve the MAB problem. They all have different approaches and different Exploration vs Exploitation ratios.
Here, we’ll define CTR as the ratio of how many times an ad was clicked vs. the number of impressions. For example, if an ad has been shown 100 times and clicked 10 times, CTR = 10/100 = 0.1
We’ll define regret as the difference between the highest possible CTR and the CTR shown. For example, if ad A has a known CTR or 0.1 and ad B has a known CTR of 0.3, each time we show ad A, we have a regret equal to 0.3 - 0.1 = 0.2. This seems like a small difference until we consider that an ad may be shown 1MM times in only hours.
In this tutorial, we want to demonstrate which algorithm implementation performs best in terms of minimizing regret. The four implementations we’ll use are:
- Random Selection
- Epsilon Greedy
- Thompson Sampling
- Upper Confidence Bound (UCB1)
Each method will be described below in the simulations.
To perform this experiment, we have to assume we know the CTR’s in advance. That way, we can simulate a click (or not) of a given ad. For example, if we show ad A, with a known CTR of 28%, we can assume the ad will be clicked on 28% of the time and bake that into our simulation.
1import numpy as np
2import pandas as pd
3from scipy.stats import beta, bernoulli
4import plotly.graph_objs as go
5from plotly.offline import init_notebook_mode, iplot
6from plotly.subplots import make_subplots
7import random
8import math
Custom functions
1def algorithm_performance(chosen_ads, total_reward, regret_list):
2 """
3 Function that will show the performance of each algorithm we will be using in this tutorial.
4 """
5
6 # calculate how many time each Ad has been choosen
7 count_series = pd.Series(chosen_ads).value_counts(normalize=True)
8 print('Ad A has been shown', count_series[0]*100, '% of the time.')
9 print('Ad B has been shown', count_series[1]*100, '% of the time.')
10
11 print('Total Reward (Number of Clicks):', total_reward) # print total Reward
12
13 x = np.arange(0, n, 1)
14
15 # plot the calculated CTR for Ad A
16 trace0 = go.Scatter(x=x,
17 y=ctr['A'],
18 name='Calculated CTR for Ad A',
19 line=dict(color=('rgba(10, 108, 94, .7)'),
20 width=2))
21
22 # plot the line with actual CTR for Ad A
23 trace1 = go.Scatter(x=[0, n],
24 y=[ACTUAL_CTR['A']] * 2,
25 name='Actual CTR for Ad A',
26 line = dict(color = ('rgb(205, 12, 24)'),
27 width = 1,
28 dash = 'dash'))
29
30 # plot the calculated CTR for Ad B
31 trace2 = go.Scatter(x=x,
32 y=ctr['B'],
33 name='Calculated CTR for Ad B',
34 line=dict(color=('rgba(187, 121, 24, .7)'),
35 width=2))
36
37 # plot the line with actual CTR for Ad A
38 trace3 = go.Scatter(x=[0, n],
39 y=[ACTUAL_CTR['B']] * 2,
40 name='Actual CTR for Ad B',
41 line = dict(color = ('rgb(205, 12, 24)'),
42 width = 1,
43 dash = 'dash'))
44
45 # plot the Regret values as a function of trial number
46 trace4 = go.Scatter(x=x,
47 y=regret_list,
48 name='Regret')
49
50 fig = make_subplots(rows=2, cols=1, shared_yaxes=True, shared_xaxes=True)
51
52 fig.add_trace(trace0, row=1, col=1)
53 fig.add_trace(trace1, row=1, col=1)
54 fig.add_trace(trace2, row=1, col=1)
55 fig.add_trace(trace3, row=1, col=1)
56 fig.add_trace(trace4, row=2, col=1)
57
58 fig.update_layout(
59 title='Simulated CTR Values and Algorithm Regret',
60 xaxis={'title': 'Trial Number'},
61 yaxis1={'title': 'CTR value'},
62 yaxis2={'title': 'Regret Value'}
63 )
64
65 return fig
This never(?) happens in real life, but for we will assume that we know the actual CTR values for both Ads for simulation purposes.
1ads = ['A', 'B']
2ACTUAL_CTR = {'A': .45, 'B': .65}
Random Selection
- 0% - Exploration
- 100% - Exploitation
Let’s start with the most naïve approach - Random Selection. The Random Selection algorithm doesn’t do any Exploration, it just chooses randomly the Ad to show.
You can think of it as coin flip - if you get heads you show Ad A, if you get tails you show Ad B. So if you have 2 ads, each add will appear ~50% (=100%/2) of the time. I guess you can tell already what are the disadvantages of this model, but let’s look on simulation.
1# For each alrorithm we will perform 1000 trials
2n = 1000
1regret = 0
2total_reward = 0
3regret_list = [] # list for collecting the regret values for each impression (trial)
4ctr = {'A': [], 'B': []} # lists for collecting the calculated CTR
5chosen_ads = [] # list for collecting the number of randomly choosen Ad
6
7# set the initial values for impressions and clicks
8impressions = {'A': 0, 'B': 0}
9clicks = {'A': 0, 'B': 0}
10
11for i in range(n):
12
13 random_ad = np.random.choice(ads, p=[1/2, 1/2]) # randomly choose the ad
14 chosen_ads.append(random_ad) # add the value to list
15
16 impressions[random_ad] += 1 # add 1 impression value for the choosen Ad
17 did_click = bernoulli.rvs(ACTUAL_CTR[random_ad]) # simulate if the person clicked on the ad usind Actual CTR value
18
19 if did_click:
20 clicks[random_ad] += did_click # if person clicked add 1 click value for the choosen Ad
21
22 # calculate the CTR values and add them to list
23 if impressions['A'] == 0:
24 ctr_0 = 0
25 else:
26 ctr_0 = clicks['A']/impressions['A']
27
28 if impressions['B'] == 0:
29 ctr_1 = 0
30 else:
31 ctr_1 = clicks['B']/impressions['B']
32
33 ctr['A'].append(ctr_0)
34 ctr['B'].append(ctr_1)
35
36 # calculate the regret and reward
37 regret += max(ACTUAL_CTR.values()) - ACTUAL_CTR[random_ad]
38 regret_list.append(regret)
39 total_reward += did_click
1fig = algorithm_performance(chosen_ads, total_reward, regret_list)
2fig.show()
Both Ads were shown equal amount of times and the more trials, the closer the CTR values are to their known values. However, the Regret is continually increasing since the algorithm doesn’t learn anything and doesn’t do any updates according to gained information. This ever-increasing regret is exactly what we’re hoping to minimize with “smarter” methods.
I would use this algorithm in two cases:
I want to be confident about the estimated CTR value for each Ad (the more impression each Ad get, the more confindet I am that estimated CTR equals to real CTR).
I have unlimited Marketing Budget ;)
1# save the reward and regret values for future comparison
2random_dict = {
3 'reward':total_reward,
4 'regret_list':regret_list,
5 'ads_count':pd.Series(chosen_ads).value_counts(normalize=True)
6}
Epsilon Greedy
- ~15% - Exploration
- ~85% - Exploitation
The next approach is Epsilon-Greedy algorithm. Its logic can be defined as follows:
- Run the experiment for some initial number of times (Exploration).
- Choose the winning variant with the highest score for initial period of exploration.
- Set the Epsilon value, $\epsilon$.
- Run experiment with choosing the winning variant for $(1-\epsilon)\% $ of the time and other options for $\epsilon\%$ of the time (Exploitation).
Let’s take a look at this algorithm in practice:
1e = .05 # set the Epsilon value
2n_init = 100 # number of impressions to choose the winning Ad
3
4# set the initial values for impressions and clicks
5impressions = {'A': 0, 'B': 0}
6clicks = {'A': 0, 'B': 0}
7
8for i in range(n_init):
9 random_ad = np.random.choice(ads, p=[1/2, 1/2]) # randomly choose the ad
10
11 impressions[random_ad] += 1
12 did_click = bernoulli.rvs(ACTUAL_CTR[random_ad])
13 if did_click:
14 clicks[random_ad] += did_click
15
16ctr_0 = clicks['A'] / impressions['A']
17ctr_1 = clicks['B'] / impressions['B']
18win_index = np.argmax([ctr_0, ctr_1]) # select the Ad number with the highest CTR
19
20print('After', n_init, 'initial trials Ad', ads[win_index],
21 'got the highest CTR =', round(np.max([ctr_0, ctr_1]), 2),
22 '(Real CTR value is', ACTUAL_CTR[ads[win_index]], ').'
23 '\nIt will be shown', (1-e)*100, '% of the time.')
1regret = 0
2total_reward = 0
3regret_list = [] # list for collecting the regret values for each impression (trial)
4ctr = {'A': [], 'B': []} # lists for collecting the calculated CTR
5chosen_ads = [] # list for collecting the number of randomly choosen Ad
6
7# set the initial values for impressions and clicks
8impressions = {'A': 0, 'B': 0}
9clicks = {'A': 0, 'B': 0}
10
11# update probabilities
12p = [e, e]
13p[win_index] = 1 - e
14
15for i in range(n):
16
17 win_ad = np.random.choice(ads, p=p) # randomly choose the ad
18 chosen_ads.append(win_ad) # add the value to list
19
20 impressions[win_ad] += 1 # add 1 impression value for the choosen Ad
21 did_click = bernoulli.rvs(ACTUAL_CTR[win_ad]) # simulate if the person clicked on the ad usind Actual CTR value
22
23 if did_click:
24 clicks[win_ad] += did_click # if person clicked add 1 click value for the choosen Ad
25
26 # calculate the CTR values and add them to list
27 if impressions['A'] == 0:
28 ctr_0 = 0
29 else:
30 ctr_0 = clicks['A']/impressions['A']
31
32 if impressions['B'] == 0:
33 ctr_1 = 0
34 else:
35 ctr_1 = clicks['B']/impressions['B']
36
37 ctr['A'].append(ctr_0)
38 ctr['B'].append(ctr_1)
39
40 # calculate the regret and reward
41 regret += max(ACTUAL_CTR.values()) - ACTUAL_CTR[win_ad]
42 regret_list.append(regret)
43 total_reward += did_click
1fig = algorithm_performance(chosen_ads, total_reward, regret_list)
2fig.show()
That’s much better; Notice how the regret has decreased by an order of magnitude! The Epsilon-Greedy algorithm seems to perform much better than Random Selection. But can you see the problem here? The winning variant from exploration period will not necessarily be the optimal variant, and you can actually exploit the suboptimal variant. This increases regret and decreases reward. According to the Law of Large Numbers the more initial trials you do, the more likely you will choose the winning variant correctly. But in Marketing you don’t usually want to rely on chance and hope that you have reached that ’large number of trials'.
In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
The good point is that you can adjust the ratio that controls how often to show the winning ad after initial trials by choosing different $\epsilon$ values.
1epsilon_dict = {
2 'reward':total_reward,
3 'regret_list':regret_list,
4 'ads_count':pd.Series(chosen_ads).value_counts(normalize=True)}
Thompson Sampling
- 50% - Exploration
- 50% - Exploitation
The Thompson Sampling exploration part is more sophisticated than e-Greedy algorithm. We have been using Beta distribution* here, however Thompson Sampling can be generalized to sample from any distributions over parameters.
*In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by $\alpha$ and $\beta$, that appear as exponents of the random variable and control the shape of the distribution.
If you want to know more about Beta distribution here is an article I found extremely useful.
Logic:
- Choose prior distributions for parameters $\alpha$ and $\beta$.
- Calculate the $\alpha$ and $\beta$ values as: $\alpha = prior + hits$, $\beta = prior + misses$. * In our case hits = number of clicks, misses = number of impressions without a click. Priors are useful if you have some prior information about the actual CTR’s of your ads. Here, we do not, so we’ll use 1.0.*
- Estimate actual CTR’s by sampling values of Beta distribution for each variant $B(\alpha_i, \beta_i)$ and choose the sample with the highest value (estimated CTR).
- Repeat 2-3.
**Custom functions**
1# functions for manual Tompson sampling
2
3def calculate_beta_dist(win_ad):
4 impressions[win_ad] += 1
5 did_click = bernoulli.rvs(ACTUAL_CTR[win_ad])
6 if did_click:
7 clicks[win_ad] += did_click
8
9 # update ctr values according to beta destribution expected values
10 ctr_0 = random.betavariate(priors['A']+clicks['A'], priors['A'] + impressions['A'] - clicks['A'])
11 ctr_1 = random.betavariate(priors['B']+clicks['B'], priors['B'] + impressions['B'] - clicks['B'])
12 highest_ad = np.argmax([ctr_0, ctr_1])
13 chosen_ads.append(highest_ad)
14
15 ctr['A'].append(ctr_0)
16 ctr['B'].append(ctr_1)
17 return highest_ad
18
19
20def plot_beta_dist():
21 x = np.arange(0, 1, 0.01)
22 y = beta.pdf(x, priors['A']+clicks['A'], priors['B'] + impressions['A'] - clicks['A'])
23 y /= y.max() ## normalize
24
25 trace0 = go.Scatter(x=x,
26 y=y,
27 name='Beta Distribution (Ad A)',
28 marker = dict(color=('rgba(10, 108, 94, 1)')),
29 fill='tozeroy',
30 fillcolor = 'rgba(10, 108, 94, .7)')
31
32 trace1 = go.Scatter(x = [ACTUAL_CTR[0]] * 2,
33 y = [0, 1],
34 name = 'Actual CTR A Value',
35 mode='lines',
36 line = dict(
37 color = ('rgb(205, 12, 24)'),
38 width = 2,
39 dash = 'dash'))
40
41 y = beta.pdf(x, priors['A']+clicks['B'], priors['B'] + impressions['B'] - clicks['B'])
42 y /= y.max()
43
44 trace2 = go.Scatter(x=x,
45 y=y,
46 name='Beta Distribution (Ad B)',
47 marker = dict(color=('rgba(187, 121, 24, 1)')),
48 fill='tozeroy',
49 fillcolor = 'rgba(187, 121, 24, .7)')
50
51 trace3 = go.Scatter(x = [ACTUAL_CTR[1]] * 2,
52 y = [0, 1],
53 name = 'Actual CTR B Value',
54 mode='lines',
55 line = dict(
56 color = ('rgb(205, 12, 24)'),
57 width = 2,
58 dash = 'dash'))
59
60 fig = go.Figure([data1, data2, data3, data4])
61 fig.updatedate_layout(
62 title='Beta Distributions for both Ads',
63 xaxis={'title': 'Possible CTR values'},
64 yaxis={'title': 'Probability Density'})
65
66 fig.show()
1regret = 0
2total_reward = 0
3regret_list = []
4ctr = {'A': [], 'B': []}
5index_list = []
6impressions = {'A': 0, 'B': 0}
7clicks = {'A': 0, 'B': 0}
8priors = {'A': 1, 'B': 1}
9
10win_ad = np.random.choice(ads, p=[1/2, 1/2]) ## randomly choose the first shown Ad
11
12for i in range(n):
13
14 impressions[win_ad] += 1
15 did_click = bernoulli.rvs(ACTUAL_CTR[win_ad])
16 if did_click:
17 clicks[win_ad] += did_click
18
19 ctr_0 = random.betavariate(priors['A']+clicks['A'], priors['B'] + impressions['A'] - clicks['A'])
20 ctr_1 = random.betavariate(priors['A']+clicks['B'], priors['B'] + impressions['B'] - clicks['B'])
21 win_ad = ads[np.argmax([ctr_0, ctr_1])]
22 chosen_ads.append(win_ad)
23
24 ctr['A'].append(ctr_0)
25 ctr['B'].append(ctr_1)
26
27 regret += max(ACTUAL_CTR.values()) - ACTUAL_CTR[win_ad]
28 regret_list.append(regret)
29 total_reward += did_click
Code
1## plot the Beta distributions
2x = np.linspace(0,1,1000)
3y = beta.pdf(x, priors['A']+clicks['A'], priors['B'] + impressions['A'] - clicks['A'])
4y /= y.max() ## normalize
5
6trace0 = go.Scatter(x=x,
7 y=y,
8 name='Beta Distribution (Ad A)',
9 marker = dict(color=('rgba(10, 108, 94, 1)')),
10 fill='tozeroy',
11 fillcolor = 'rgba(10, 108, 94, .7)')
12
13trace1 = go.Scatter(x = [ACTUAL_CTR['A']] * 2,
14 y = [0, 1],
15 name = 'Actual CTR A Value',
16 mode='lines',
17 line = dict(
18 color = ('rgb(205, 12, 24)'),
19 width = 2,
20 dash = 'dash'))
21
22y = beta.pdf(x, priors['A']+clicks['B'], priors['B'] + impressions['B'] - clicks['B'])
23y /= y.max()
24
25trace2 = go.Scatter(x=x,
26 y=y,
27 name='Beta Distribution (Ad B)',
28 marker = dict(color=('rgba(187, 121, 24, 1)')),
29 fill='tozeroy',
30 fillcolor = 'rgba(187, 121, 24, .7)')
31
32trace3 = go.Scatter(x = [ACTUAL_CTR['B']] * 2,
33 y = [0, 1],
34 name = 'Actual CTR B Value',
35 mode='lines',
36 line = dict(
37 color = ('rgb(205, 12, 24)'),
38 width = 2,
39 dash = 'dash'))
40
41fig = go.Figure([trace0, trace1, trace2, trace3])
42fig.update_layout(
43 title='Beta Distributions for both Ads',
44 xaxis={'title': 'Possible CTR values'},
45 yaxis={'title': 'Probability Density'})
46
47
48fig.show()
Note the intersection area. There might be the cases that value of Beta distribution for Ad A will be higher, than for Ad B, so algorithm will choose Ad A(which performs worse).
1fig = algorithm_performance(chosen_ads, total_reward, regret_list)
2fig.show()
The regret is the lowest we’ve seen so far. Each uptick in regret happened when the Ad A was chosen. In the CTR Value plot, you can see that in the beginning, the green (Thompson sampled CTR value for Ad A) values were often greater than the tan (Thompson sampled CTR value for Ad B), resulting in Ad A being shown.
Note the differences in variations for each ad. The algorithm explores constantly, then naturally exploits the ad variant with the highest sample taken from the appropriate Beta distribution. This can be shown in the top plot of the distributions. The Beta distribution for Ad B is much higher and narrower than that of Ad A, meaning its sampled values will be consistently higher than those of Ad A, which is exactly what we want!
1thompson_dict = {
2 'reward':total_reward,
3 'regret_list':regret_list,
4 'ads_count':pd.Series(chosen_ads).value_counts(normalize=True)}
Upper Confidence Bound (UCB1)
- 50% - Exploration
- 50% - Exploitation
Unlike the Thompson Sampling algorithm, the Upper Confidence Bound cares more about the uncertainty (high variation) of each variant. The more uncertain we are about one variant, the more important it is to explore.
Algorithm chooses the variant with the highest upper confidence bound value (UCB) which represents the highest reward guess for the variant. It is defind as follows:
$UCB = \bar x_i + \sqrt{\frac{2 \cdot \log{t}}{n}}$ ,
where $\bar x_i$ - the (CTR rate) for $i$-th step,
$t$ - total number of (impressions) for all variants,
$n$ - total number of (impressions) for choosen variant
The logic is rather straightforward:
- Calculate the UCB for all variants.
- Choose the variant with the highest UCB.
- Go to 1.
1regret = 0
2total_reward = 0
3regret_list = []
4ctr = {'A': [], 'B': []}
5index_list = []
6impressions = {'A': 0, 'B': 0}
7clicks = {'A': 0, 'B': 0}
8
9for i in range(n):
10
11 win_ad = 'A'
12 max_upper_bound = 0
13 for k in ads:
14 if (impressions[k] > 0):
15 CTR = clicks[k] / impressions[k]
16 delta = math.sqrt(2 * math.log(i+1) / impressions[k])
17 upper_bound = CTR + delta
18 ctr[k].append(CTR)
19 else:
20 upper_bound = 1e400
21 if upper_bound > max_upper_bound:
22 max_upper_bound = upper_bound
23 win_ad = k
24 index_list.append(win_ad)
25 impressions[win_ad] += 1
26 reward = bernoulli.rvs(ACTUAL_CTR[win_ad])
27
28 clicks[win_ad] += reward
29 total_reward += reward
30
31 regret += max(ACTUAL_CTR.values()) - ACTUAL_CTR[win_ad]
32 regret_list.append(regret)
1fig = algorithm_performance(chosen_ads, total_reward, regret_list)
2fig.show()
You can see that regret went up when the algorithm tried to decrease uncertainty of CTR for Ad A by choosing it.
It might be useful when you want the model to choose the best variant more often, but are still interested in reducing the uncertainty of both variants.
1ucb1_dict = {
2 'reward':total_reward,
3 'regret_list':regret_list,
4 'ads_count':pd.Series(chosen_ads).value_counts(normalize=True)}
Comparison and Conclusions
Now let’s compare four of this methods and see which one performed better for our problem.
First of all, it’s obvious that we want to show the Ad B more often since its actual CTR is 0.65. Let’s take a look at the ratio how many time the right Ad has been chosen for each algorithm.
Code
1trace0 = go.Bar(
2 x=['Random Selection', 'Epsilon Greedy', 'Thompson Sampling', 'UCB1'],
3 y=[random_dict['ads_count']['A'],
4 epsilon_dict['ads_count']['A'],
5 thompson_dict['ads_count']['A'],
6 ucb1_dict['ads_count']['A']],
7 name='Ad A',
8 marker=dict(color='rgba(10, 108, 94, .7)'))
9
10trace1 = go.Bar(
11 x=['Random Selection', 'Epsilon Greedy', 'Thompson Sampling', 'UCB1'],
12 y=[random_dict['ads_count']['B'],
13 epsilon_dict['ads_count']['B'],
14 thompson_dict['ads_count']['B'],
15 ucb1_dict['ads_count']['B']],
16 name='Ad B',
17 marker=dict(color='rgba(187, 121, 24, .7)'))
18
19fig = go.Figure([trace0, trace1])
20
21fig.update_layout(
22 title='Ratio of appearance of both Ads throughout the trials',
23 xaxis={'title': 'Algorithm'},
24 yaxis={'title': 'Ratio'},
25 barmode='stack')
26
27fig.show()
As you can see, three algorithms Epsilon Greedy, Thimpson Sampling and UCB1 showed Ad B most of the times (95%+).
Code
1trace0 = go.Scatter(
2 x=np.arange (0, n, 1),
3 y=random_dict['regret_list'],
4 name='Random Selection',
5 marker=dict(color='#ffcc66')
6)
7trace1 = go.Scatter(
8 x=np.arange (0, n, 1),
9 y=epsilon_dict['regret_list'],
10 name='e-Greedy',
11 marker=dict(color='#0099ff')
12)
13trace2 = go.Scatter(
14 x=np.arange (0, n, 1),
15 y=thompson_dict['regret_list'],
16 name='Thompson Sampling',
17 marker=dict(color='#ff3300')
18)
19trace3 = go.Scatter(
20 x=np.arange (0, n, 1),
21 y=ucb1_dict['regret_list'],
22 name='UCB1',
23 marker=dict(color='#33cc33')
24)
25
26fig = go.Figure([trace0, trace1, trace2, trace3])
27
28fig.update_layout(
29 title='Regret by the Algorithm',
30 xaxis={'title': 'Trial'},
31 yaxis={'title': 'Regret'}
32)
33
34fig.show()
Taking to account that Thompson Sampling and Epsilon-Greedy algorithms chose ad with the higher CTR (B) most of the time, it shouldn’t come as surprise that their regret values are the lowest.
Code
1trace0 = go.Bar(
2 x=[ucb1_dict['reward'], thompson_dict['reward'], epsilon_dict['reward'], random_dict['reward']],
3 y=['UCB1', 'Thompson Sampling', 'e-Greedy','Random Selection'],
4 orientation = 'h',
5 marker=dict(color=['#33cc33', '#ff3300', '#0099ff', '#ffcc66']),
6 opacity=0.7
7)
8
9trace1 = go.Scatter(
10 x=[ucb1_dict['reward'], thompson_dict['reward'], epsilon_dict['reward'], random_dict['reward']],
11 y=['UCB1', 'Thompson Sampling', 'e-Greedy', 'Random Selection'],
12 mode='text',
13 text=[ucb1_dict['reward'], thompson_dict['reward'], epsilon_dict['reward'], random_dict['reward']],
14 textposition='middle left',
15 line = dict(
16 color = ('rgba(255,141,41,0.6)'),
17 width = 1
18 ),
19 textfont=dict(
20 family='sans serif',
21 size=16,
22 color='#000000'
23 )
24)
25
26fig = go.Figure([trace0, trace1])
27
28fig.update_layout(
29 title='Total Reward by Algorithms',
30 xaxis={'title': 'Total Reward (Clicks)'},
31 margin={'l':200},
32 showlegend=False
33)
34
35fig.show()
It can be the case that total reward for the algorithm with the lowest regret value will not be the highest. It is caused by the fact that even if the algorithm chooses the right ad it doesn’t guarantee that the user will click on it.
As was told from the beginning, the Thompson Sampling is generally the best choice, but we also looked at other algorithms and discussed how and when they might be useful. The method you choose depends on your unique problem, prior information you have and what information you want to receive afterwards.