Reinforcement Learning (Part 1): Bandits

Every day, we interact with our world and make decisions based on experience. Whether eating out or whether to use the stairs or the elevator, every day we make a decision. Sometimes we’re doing what we know and sometimes we’re winging it with something new. These are a form of reinforcement learning. This form of learning is at the heart of most living things; with infants learning to walk by making mistakes and elephants in a zoo learning that electric fence must be kept away from.

Essentially, RL, is decision-making under uncertainty and trial-and-error learning. Be it a computer playing GO or robots learning to walk, reinforcement learning is what is behind these clever actions.

RL is not some one monolithic chunk of high-level mathematics or some esoteric algorithms. It has some levels with elementary forms as we are going to discuss.

This article is part of the series in which I am going to work through RL from scratch to some concepts which you may have heard of such as GRPO.

What This Series Covers

We will start with bandit algorithms which are the tutorial level of RL before covering the final bosses.

Some of the concepts to be covered include:

Multi-armed Bandits - Simple form of learning from rewards
Markov Decision Processes (MDPs) - The foundation of RL; The math behind decision making over time.
Dynamic Programming and Monte Carlo Methods - Smarter ways to solve MDPs without brute force
Temporal Difference Learning - Learning on the fly
Policy Gradient Methods - Teaching how to act and not what actions give you sugars
Actor-Critic - Making everything smoother and stable

Multi Armed Bandits

For this, I am taking you to VEGAS🎉. Imagine you walk into a casino and your goal is to maximize your winnings by figuring out which machine is the jackpot. What is your approach? Do you just stick with one machine that gives you some money? Or do you try out different machines hoping you might be blessed by the gods of luck?

That’s the multi-armed bandit problem — a fundamental dilemma between exploration and exploitation. And believe it or not, understanding this simple problem sets the stage for everything else in RL.

Let’s denote some notations which are fundamental to Multi-Armed Bandit problems

Notation

$k$: number of actions (arms)

$t$: discrete time step

$q^*(a)$: true value of action $a$

$Q_t(a)$: estimate of $q^*(a)$ at time $t$

$N_t(a)$: number of selections of action $a$ up to time $t$

$H_t(a)$: preference for action $a$ at time $t$

$\pi_t(a)$: probability of selecting action $a$ at time $t$

$\bar{R}_t$: estimated expected reward at time $t$

Don’t be afraid of this. We are still in Vegas remember. I will guide you through everything. Now, let’s break down what each term means in the context of our casino analogy.

Starting with $k$, this can be understood as the number of machines/arms each with an unknown reward distribution. $t$ represents the current decision round. Every time you pull a lever, $t$ increases by 1. $q^*(a)$ is the expected reward for pulling arm $a$, i.e., the true average payout of that machine over an infinite number of pulls. Remember the true action value is not a fixed amount but an expectation over many trials. Some machines might have high variance (big payouts rarely, small payouts often). $Q_t(a)$ is your best estimate the true value of a machine $q^*(a)$ because you’ve never been to Vegas and therefore you don’t know which machine is has better payouts. $\bar{R}_t$ is the sample mean reward observed up to time $t$. If we’ve chosen different machines, this is the overall average reward across all actions taken so far..

Back to the question, do you explore $k$ machines or do you just stick to one? If you choose to exploit one, you may miss out on a better machine while if you decide to explore the options, you may lose some money but in the long run have an understanding on which machines are the best.

But how do we estimate the value of an action?

Action-Value Methods

We will start by understanding how we can estimate the value of the action we have taken at the currect timestep.

A simple way to define the estimate is by averaging past rewards we have gotten from choosing the action.

$$ Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}} $$

This might not be the best approach to estimate the value of an action but at it is a reasonable starting point.

Greedy Action Selection

Now that we have estimates $Q_t(a)$, we can choose actions based on these estimates. A straightforward way to make decisions is to always pick the action with the highest estimated value:

$$ A_t = \arg\max_a Q_t(a) $$

This means at each step, we choose the action that has historically given the highest rewards. This is called the greedy policy, because we are always exploiting what we know.

If we only exploit, we might get stuck on a suboptimal action. What if there’s a better action we haven’t tried enough times? This is why we introduce exploration.

$\varepsilon$-Greedy Exploration

Instead of always choosing the greedy action, a better approach is to sometimes try other actions. One way to do this is the ε-greedy method, where we:

With probability $1 - \varepsilon$, pick the best-known action (exploit). With probability $\varepsilon$, pick a random action (explore).

Mathematically:

$$ \pi_t(a) = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{k}, & \text{if } a = \arg\max Q_t(a) \\ \frac{\varepsilon}{k}, & \text{otherwise} \end{cases} $$

This is better because:

It prevents getting stuck on suboptimal actions.
It gradually improves estimates of less-explored actions.
It balances short-term gains with long-term learning.

Incremental Implementation

The problem with estimating the value of actions through averaging is that you need to keep count of all the rewards gotten from past actions. This is inefficient considering cases when the timesteps approaches infinity.

$$ \begin{aligned} Q_n(a) &= \frac{R_1 + R_2 + \cdots + R_{n-1}}{n-1} = \color{blue} \frac{1}{n-1} \sum_{i=1}^{n-1} R_i \\ Q_{n+1}(a) &= \frac{R_1 + R_2 + \cdots + R_n}{n} = \frac{1}{n} \sum_{i=1}^{n} R_i \\ &= \frac{1}{n} \left(R_n + \sum_{i=1}^{n-1} R_i\right) \\ &= \frac{1}{n} \left(R_n + (n-1) \color{blue} \frac{1}{n-1} \sum_{i=1}^{n-1} R_i \color{none} \right) \\ &= \frac{1}{n} R_n + \frac{n-1}{n} Q_n \\ &= \frac{R_n + (n-1)Q_n}{n} \\ &= \frac{R_n + nQ_n - Q_n}{n} \\ &= Q_n + \frac{1}{n}({R_n - Q_n}) \end{aligned} $$

From the derived equations, we can observe that now the value can be estimated with just the initial value and the last reward.
The general form of this equation is and it looks similar to the update rule of gradient ascent:

$$ \text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} \cdot (\text{Target} - \text{OldEstimate}) $$

Handling Non-Stationary Problems

So far, we have been treating bandits as if the reward distribution are fixed, like the slot machines have the same payout probabilities. But is that really true? Machines that used to pay, may suddenly turn stingy.

This is the case for non-stationary environments, where the true value of actions changes as time progresses and if we treat past data as equally relevant, our estimates will lag behind reality.

Our previous update equation;

$$ Q_{n+1}(a) = Q_n + \frac{1}{n}({R_n - Q_n}) $$

relies on all past rewards equally. However, if the reward distribution shifts, this approach is too slow to adapt since older data still heavily influences $Q_{n+1}$. A better approach is to gradually “forget” old data by giving recent rewards more importance.

Instead of using $\frac{1}{n}$ as the step size parameter, we will use a fixed step-size parameter $\alpha$ where $0 < \alpha \leq 1$.

This update rule ensures that the learning is slow and stable if $\alpha$ is small or fast and responsive if $\alpha$ is large.

$$ \begin{aligned} Q_{n+1} &= Q_n + \alpha (R_n - Q_n) \\ &= \alpha R_n + (1-\alpha) \color{cyan} Q_n \\ &= \alpha R_n + (1-\alpha) \color{cyan} [\alpha R_{n-1} + (1-\alpha) Q_{n-1}] \\ &= \alpha R_n + \alpha (1-\alpha) R_{n-1} + (1-\alpha)^2 \color {yellow} Q_{n-1} \\ &= \alpha R_n + \alpha (1-\alpha) R_{n-1} + (1-\alpha)^2 \color{yellow} [\alpha R_{n-2} +(1 - \alpha)Q_{n-2}] \\ &\\ &\text{Continuing the expansion of } Q: \\ &\\ &= \alpha R_n + \alpha (1-\alpha) R_{n-1} + \alpha (1-\alpha)^2 R_{n-2} + \cdots + \alpha (1-\alpha)^{n-1} R_1 + (1-\alpha)^n Q_1 \\ &= (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha (1-\alpha)^{n-i} R_i \end{aligned} $$

Woah woah woah. Let take in what is happening.

$Q_{n+1}$ is the estimate of the action value at the next time step. $(1-\alpha)^n Q_1$ is the influence of the initial estimate of the action value before any rewards are observed. The factor $(1-\alpha)$ determines how much the initial estimate still influences the current estimate. If $\alpha$ is small, the influence slowly fades slowly and the opposite. $\sum_{i=1}^{n} \alpha (1-\alpha)^{n-i} R_i$ sums up all previous rewards $R_i$ but weighs them differently. The weight of each past reward depends on the factor $ (1-\alpha)^{n-i}$ which decays exponentially over time. Meaning, at infinity past reward will be near zero and have little effect. Only the most recent rewards matter.

This is known as the exponential moving average or exponential recency-weighted average.

Using $\frac{1}{n}$ as or step size in the first approach converges to $q^*(a)$ by the law of large numbers but reacts slowly to changes in non-stationary environments while using a fixed step size $\alpha$ adats faster but never fully converges.

Can we dynamically adjust the step size to get the best of both worlds?

Some adaptive step-size strategies include:

Harmonic Step-Size (Slower Decay) - $ \alpha_n(a) = \frac{c}{c + N_n(a)}$
Recency-Weighted Average (Exponential Decay) - $ \alpha_n(a) = \frac{1}{N_n(a)^\beta}, \space 0 < \beta < 1 $
Optimistic Step-Size Adjustment - $ \alpha_n(a) = \frac{\text{variance of rewards for a}}{\text{number of times a was chosen}}$

Step-Size Type	Converges?	Adapts Quickly?	Best for
$ \alpha_n = \frac{1}{N_n(a)} $ (sample average)	✅ Yes	❌ No	Stationary problems
$ \alpha $ (constant step size)	❌ No	✅ Yes	Non-stationary problems
$ \alpha_n = \frac{c}{c + N_n(a)} $	✅ Yes	✅ Moderate	Partially changing environments
$ \alpha_n = \frac{1}{N_n^\beta(a)}, \quad 0 < \beta < 1 $	✅ Yes	✅ Good	General RL settings
Variance-based step-size	✅ Yes	✅ Best	Noisy rewards

Upper Confidence Bounds

We have found out that random exploration with probability $\varepsilon$ is great because it lets us try options which may be better than the currect action. However, we can end up exploring a bad action which we already know is poor. It is like someone knowing that a certain slot machine is not a money spitter and still trying it yet there are some machines that have not been tried out yet.

Instead of just picking the one with the highest estimated value, what if we also consider how uncertain we are?

For an action $a$, at time $t$, the UCB selection rule is:

$$ A_t = \arg\max_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right] $$

where $c$ is the exploration parameter (higher values encourage more exploration)

This means that if an action has a high estimated reward, it gets chosen (exploitation) and if an action has been rarely tried i.e., $N_t(a)$ is small, the confidence term is large and we are more likely to explore it. Over time, the uncertainly decreases and we settle on the best action.

Thompson Sampling

We saw that UCB selects actions by adding a confidence term to the action-value estimate but it only considers the worst-case uncertainty (favouring actions with fewer samples). It is a rule based approach.

Thompson Sampling takes a Bayesian approach—instead of just using a point estimate of reward, we model a probability distribution for each action’s value and sample from it. Instead od relying on a single estimate of $Q_t(a)$, we assume that each action has a probability distribution for its reward which gets updated as we gather more data.

We start with a prior distribution over the expected rewards of every action, then for each action we sample a random reward from its estimated distribution. From there, we pick the action with t he highest sampled reward and then observe the actual reward and update the distribution.

This Bayesian approach naturally balances exploration and exploitation and it works better than $\varepsilon$-greedy and UCB in many cases

Gradient Bandits

Back in Vegas, most gamblers would estimate how good each machine is and use that. But I will teach you the real high-roller playstyle. Instead of jumping from machine to machine, tracking raw winnings for each machine (which is tiresome), we can assign a preference score to each machine. A machine that seems better, gets a higher score. Have a taste in machines! And this is a better approach since it is self-adjusting.

What we have discusses so far is estimating action-values $Q_t(a)$ and we would like to shift to learning based of preferences.

Gradient-based bandits learn a policy directly, rather than estimating action values. We will use softmax function to compute action probabilities:

$$ \pi_t(a) = \frac{e^{H_t(a)}}{\sum_b{e^{H_t(b)}}} $$

where:

$H_t(a)$ is the preference for action $a$ at time $t$.
$\pi_t(a)$ is the probability of selecting action $a$.

We update action preferences based on reward like:

$$ \begin{align*} H_{t+1}(a) &= H_t(a) + \alpha (R_t - \bar{R}_t) (1 - \pi_t(a)) \\ H_{t+1}(b) &= H_t(b) - \alpha (R_t - \bar{R}_t) \pi_t(b) \quad \forall b \neq a \end{align*} $$

where:

$\bar{R}_t$ is the average reward up to time $t$.

Generallly, gradient bandits work better for non-stationary problems and they don’t require value estimation

What This Series Covers#

Multi Armed Bandits#

Notation#

Action-Value Methods#

Greedy Action Selection#

$\varepsilon$-Greedy Exploration#

Incremental Implementation#

Handling Non-Stationary Problems#

Upper Confidence Bounds#

Thompson Sampling#

Gradient Bandits#