11 Reinforcement Learning: Optimal Decision-Making Under Uncertainty

The Core Question: How Do Machines ‘Learn’ Optimal Decisions?

How can a machine, operating in a complex and uncertain world, automatically learn an optimal sequence of decisions through interaction and trial-and-error to achieve its long-term goals?

This is not just a technical question; it touches the core of economics, including rational decision-making, dynamic optimization, and game theory.

The Concept of an ‘Agent’ in Economics

In economics, we constantly study ‘agents’—consumers, firms, governments—and how they make choices to maximize their utility or profit.

Reinforcement learning provides us with a novel, computational framework to simulate and optimize the decision-making processes of these agents, especially under conditions of incomplete information and dynamic environments.

A Business Case: Dynamic Pricing

Imagine you are a pricing algorithm (an agent) for an e-commerce platform.

State: Current inventory, competitor prices, user traffic, time of day…
Action: Increase price, decrease price, maintain price.
Reward: Total revenue for the day.
Goal: Find a pricing Policy that maximizes total revenue over the next 30 days.

You don’t have a dataset of ‘correct answers’. You can only learn by trying different prices and observing the outcomes. This is the home ground of reinforcement learning.

This Chapter’s Goal: Build Economic Intuition for RL

After completing this chapter, I want you to be able to:

Understand: The fundamental differences between RL and supervised/unsupervised learning.
Master: The core framework for describing decision problems—the Markov Decision Process (MDP).
Apply: Explain the roles of Policy and Value in decision-making.
Envision: Appreciate the potential of algorithms like Q-learning in dynamic pricing or algorithmic trading.

The Three Families of Machine Learning

Machine learning is typically categorized into three main paradigms based on the learning style and data format.

Supervised Learning

Learns a mapping function from inputs to outputs, like a teacher providing correct answers.

Core Task: Prediction or Classification.
Data Format: Labeled data (X, y), where y is the correct answer.
Economics Analogy:
- Regression: Predicting next year’s GDP based on historical data (GDP, interest rates, inflation).
- Classification: Determining if a customer will default on a loan based on their information.
Limitation: Relies on large amounts of high-quality labeled data; cannot make sequential decisions.

Unsupervised Learning

Discovers hidden structures or patterns in data without any correct answers.

Core Task: Discovering data structure.
Data Format: Unlabeled data X.
Economics Analogy:
- Clustering: Segmenting customers into different market segments.
- Dimensionality Reduction: Extracting a few key factors from hundreds of macroeconomic indicators.
Limitation: Can only describe data, not guide decisions.

Reinforcement Learning

An agent learns how to make a sequence of decisions through ‘trial and error’ interaction with an environment to maximize cumulative reward.

Core Task: Learning an optimal sequence of decisions.
Data Format: Interaction data (state, action, reward).
Economics Analogy:
- A firm continuously adjusts its advertising, R&D, and pricing strategies over multiple quarters to maximize its long-term market share and profit.

The Essence of RL: No ‘Correct Answer’

The most fundamental difference from supervised learning is that there is no predefined correct action in any given situation.

The Essence of RL: Delayed Feedback

Today’s action might have consequences that only become apparent far into the future. The feedback signal (reward) is often delayed.

The Essence of RL: The Explore vs. Exploit Trade-off

An agent must decide: should it exploit what it already knows to get a good reward, or should it explore new actions to potentially discover even better rewards?

Core Framework: Markov Decision Process (MDP)

Reinforcement learning problems are typically modeled mathematically as a Markov Decision Process (MDP).

This is a mathematical framework for modeling sequential interactions between a decision-maker (the agent) and an environment. Understanding the MDP is the foundation for understanding all of RL.

It consists of five core components: (S, A, P, R, γ).

The Protagonists of an MDP: Agent and Environment

MDP Component 1: State (S)

A State ($s \in S$) is a snapshot description of the environment at a particular moment in time.

Definition: A set of information describing the current situation of the world, sufficient for making future decisions.
Key Property (Markov Property): The future depends only on the present state, not on the sequence of events that preceded it. \[ \large{ P(S_{t+1}|S_t, A_t) = P(S_{t+1}|S_1, A_1, ..., S_t, A_t) } \]
Economics Examples:
- A company’s state could be (cash flow, inventory level, market share).
- An economy’s state could be (GDP growth, inflation rate, unemployment rate).

MDP Component 2: Action (A)

An Action ($a \in A$) is an operation that the agent can perform.

Definition: The set of behaviors the agent can choose from in each state.
Types: Can be discrete (e.g., Buy, Sell, Hold) or continuous (e.g., set price to $10.53).
Economics Examples:
- A company’s actions could be (set product price, determine advertising budget).
- A central bank’s actions could be (raise rates by 25 bps, lower rates, maintain).

MDP Component 3: Reward (R)

A Reward ($R_t$) is the immediate feedback signal the environment gives to the agent after it takes action $A_t$ in state $S_t$.

Definition: A scalar value that measures the ‘goodness’ of an action. The Reward Hypothesis states that all goals can be described as the maximization of expected cumulative reward.
RL’s Goal: To maximize cumulative reward, not instantaneous reward.
Economics Examples:
- For a trading algorithm, the reward is the daily portfolio return.
- For a company, the reward is the quarterly profit.

MDP Component 4: Transition Probability (P)

State Transition Probability describes the dynamics of the environment.

Definition: The probability of transitioning to the next state $s'$ after taking action $a$ in state $s$.
Mathematical Notation: \[ \large{ p(s' | s, a) = P(S_{t+1} = s' | S_t = s, A_t = a) } \]
Economics Example:
- Given a current market share of 20% (s), if the company invests $1M in advertising (a), what is the probability that the market share will grow to 22% (s') next quarter? This is often stochastic.

MDP Component 5: Discount Factor ($\gamma$)

The Discount Factor ($\gamma$) reflects the agent’s preference for present rewards over future rewards. We will discuss this in more detail shortly. It is a value between 0 and 1.

Summary: The MDP Quintuple

A complete Markov Decision Process is defined by a quintuple: $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$

The Interaction Loop: The Agent-Environment ‘Dance’

The core process of reinforcement learning is a continuous loop of interaction.

The Interaction Loop: A Step-by-Step Breakdown

Observe: At time t, the agent observes the environment’s state $S_t$.
- (e.g., A trading algorithm observes the current stock price is $100 and volume is 500k shares.)
Decide: The agent selects an action $A_t$ based on its policy $\pi$.
- (e.g., The algorithm decides to ‘Buy 100 shares’.)
Execute: The environment receives the action $A_t$.
Evolve: The environment transitions to a new state $S_{t+1}$ and provides a reward $R_{t+1}$ based on its internal dynamics.
- (e.g., The market executes the trade, the price moves to $101, and the day’s unrealized profit is +$100.)
This loop repeats.

The Agent’s Brain: The Policy Function ($\pi$)

The Policy ($\pi$) is the agent’s code of conduct, its ‘brain’. It defines a mapping from states to actions.

Definition: A mapping from states to actions. It tells the agent what to do in each state.
Types:
- Deterministic Policy: $a = \pi(s)$, a unique action for each state.
- Stochastic Policy: $\pi(a|s) = P(A_t = a | S_t = s)$, the probability of taking action $a$ in state $s$.
RL’s Goal: To find an optimal policy $\pi^*$ that maximizes long-term return.

The Economic Intuition of a Stochastic Policy

Why might we need a policy that is random?

To Explore the Unknown

A deterministic policy might get stuck in a rut, missing out on better options forever. Randomness allows the agent to explore new, untried actions. This is like a company occasionally trying a completely new marketing channel.

To Counter Adversaries

In some situations, the optimal behavior is inherently random. This is common in game theory; the optimal strategy for ‘Rock, Paper, Scissors’ is to choose randomly, making you unpredictable to your opponent.

The Objective Function: Maximize Cumulative Return

The agent’s goal is not to maximize the immediate reward $R_{t+1}$, but to maximize the future cumulative return starting from the current moment.

The return $G_t$ is defined as: \[ \large{ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} } \]

Core Concept: The Discount Factor ($\gamma$)

The $\gamma$ (gamma) in the formula is the discount factor, where $0 \le \gamma \le 1$.

Economic Intuition: This is identical to the concept of Net Present Value (NPV) in finance! $\gamma$ is like $1/(1+r)$, where $r$ is the discount rate.
Why do we need to discount?
1. Time Value of Money: A dollar today is worth more than a dollar tomorrow.
2. Model Uncertainty: Our predictions about the distant future are inherently uncertain; the discount factor reflects this.
3. Mathematical Convergence: It mathematically ensures that the sum of an infinite series of rewards is a finite value.

The Effect of the Discount Factor: An Example

Suppose the reward sequence is [+10, +2, +5, +8, ...]

$\gamma = 0$ (Extremely Myopic)

$G_t = 10 + 0 \cdot 2 + 0 \cdot 5 + \dots = 10$ Only cares about the immediate reward.

$\gamma = 0.9$ (Farsighted)

$G_t = 10 + 0.9 \cdot 2 + 0.9^2 \cdot 5 + \dots$ $G_t = 10 + 1.8 + 4.05 + \dots$ Balances short-term and long-term gains.

The choice of $\gamma$ is a key modeling decision that defines the agent’s degree of ‘foresight’.

The Yardstick for a Good Policy: The Value Function

How do we determine if a state is ‘good’ or ‘bad’? Or if an action is ‘good’ or ‘bad’?

This is the role of the Value Function. It is the Expectation of future returns.

There are two core types of value functions:

State-Value Function ($V_\pi(s)$)
Action-Value Function ($Q_\pi(s, a)$)

State-Value Function V(s): How Good is My Current Situation?

The State-Value Function $V_\pi(s)$ answers the question: ‘If I start from state $s$ and always follow policy $\pi$, how much total future return can I expect to get?’

Definition: \[ \large{ V_\pi(s) = E_\pi[G_t | S_t = s] = E_\pi[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s] } \]
Economics Example: Given the current macroeconomic conditions (state $s$), if the Federal Reserve follows its current interest rate policy (policy $\pi$), what is the expected value of future total economic output (return)?

Action-Value Function Q(s, a): How Good is This Choice?

The Action-Value Function $Q_\pi(s, a)$ (also known as the Q-function) answers: ‘In state $s$, if I take action $a$ and then follow policy $\pi$ thereafter, how much total future return can I expect?’

Definition: \[ \large{ Q_\pi(s, a) = E_\pi[G_t | S_t = s, A_t = a] } \]
Intuition: $Q_\pi(s, a)$ is an estimate of the Quality of taking a specific action in a specific state.
Economics Example: Given the current inventory level (state $s$), if the company chooses to cut prices by 10% (action $a$), what is the expected value of the company’s future total profit (return)?

The Relationship Between V and Q Functions

The value of a state is the expected value of all possible actions that could be taken from that state, weighted by the policy’s probability of choosing them.

\[ \large{ V_\pi(s) = \sum_{a \in A} \pi(a|s) Q_\pi(s, a) } \]

$\pi(a|s)$ is the probability of choosing action $a$ in state $s$.
$Q_\pi(s, a)$ is the value after choosing action $a$.

This relationship is very intuitive and crucial for later algorithms.

The Core of Dynamic Programming: The Bellman Expectation Equation

The Bellman Equation is the most fundamental equation in reinforcement learning. It establishes a recursive relationship between the value of a state and the values of its successor states.

For the state-value function $V_\pi(s)$: \[ \large{ V_\pi(s) = \sum_{a \in A} \pi(a|s) \left( E[R_{t+1}|s,a] + \gamma \sum_{s' \in S} p(s'|s,a) V_\pi(s') \right) } \]

In words: The value of the current state = the expectation over all possible actions of the (immediate reward + the discounted expected value of the next state).

Decomposing the Bellman Expectation Equation (V-function)

Let’s break down the equation: \[ V_\pi(s) = \sum_{a \in A} \pi(a|s) \Big( \underbrace{R(s,a)}_{\text{What I get now}} + \gamma \underbrace{\sum_{s' \in S} p(s'|s,a) V_\pi(s')}_{\text{What I can expect later}} \Big) \]

Outer Expectation $\sum_{a \in A} \pi(a|s) \dots$: Because our policy might be stochastic, we must average over all possible actions.
Immediate Reward $R(s,a)$: The reward received immediately after taking action $a$.
Inner Expectation $\sum_{s' \in S} p(s'|s,a) \dots$: Because the environment’s response (state transition) might be stochastic, we must average over all possible next states.
$V_\pi(s')$: This is the magic of recursion. The value of the next state can be expanded in the same way.

Bellman Expectation Equation (Q-function)

For the action-value function $Q_\pi(s, a)$, the Bellman equation has a slightly different form:

\[ \large{ Q_\pi(s, a) = R(s,a) + \gamma \sum_{s' \in S} p(s'|s,a) \sum_{a' \in A} \pi(a'|s') Q_\pi(s', a') } \]

Since $\sum_{a' \in A} \pi(a'|s') Q_\pi(s', a') = V_\pi(s')$, we can simplify this to:

\[ \large{ Q_\pi(s, a) = R(s,a) + \gamma \sum_{s' \in S} p(s'|s,a) V_\pi(s') } \]

In words: The value of taking action a in state s = immediate reward + discounted expected value of the next state.

Solving the RL Problem: Finding the Optimal Policy $\pi^*$

The ultimate goal of reinforcement learning is to find an optimal policy $\pi^*$ that achieves a higher or equal expected return than any other policy from any initial state.

\[ \large{ \pi^* = \arg\max_{\pi} V_\pi(s) \quad \text{for all } s \in S } \]

The corresponding optimal value functions are denoted $V^*(s)$ and $Q^*(s, a)$.

The Bellman Optimality Equation

For an optimal policy, the Bellman equation takes a special form: the Bellman Optimality Equation. It no longer involves an expectation over actions but instead takes the maximum.

For $V^*(s)$: \[ \large{ V^*(s) = \max_{a \in A} \left( R(s,a) + \gamma \sum_{s' \in S} p(s'|s,a) V^*(s') \right) } \]

For $Q^*(s, a)$: \[ \large{ Q^*(s, a) = R(s,a) + \gamma \sum_{s' \in S} p(s'|s,a) \max_{a' \in A} Q^*(s', a') } \]

The max operator embodies optimality: the agent will always choose the action that leads to the best possible future.

How to Solve? Generalized Policy Iteration (GPI)

We know the target (the Bellman Optimality Equation), but how do we find the solution? Most RL algorithms follow a common pattern called Generalized Policy Iteration (GPI).

This is an iterative process involving two intertwined steps:

Policy Evaluation: How Good is My Current Policy?

Suppose we have a fixed policy $\pi$ (e.g., an existing set of trading rules). We want to know its value.

Task: Compute $V_\pi(s)$.
Method: Use the Bellman Expectation Equation and solve it iteratively. \[ \large{ V_{k+1}(s) = \sum_{a \in A} \pi(a|s) \left( R(s,a) + \gamma \sum_{s' \in S} p(s'|s,a) V_k(s') \right) } \] We start with a random $V_0$ and repeatedly use the old value function to compute the new one until $V_k$ converges. This process is called iterative policy evaluation.

Policy Improvement: Can I Do Better?

Once we know the value function $V_\pi$ for our current policy $\pi$, we can try to improve it.

Method: For each state $s$, instead of following $\pi$, we act ‘greedily’ by choosing the action that leads to the highest Q-value: \[ \large{ \pi'(s) = \arg\max_{a \in A} Q_\pi(s,a) } \] \[ \large{ = \arg\max_{a \in A} \left( R(s,a) + \gamma \sum_{s' \in S} p(s'|s,a) V_\pi(s') \right) } \]
It can be proven that the new policy $\pi'$ is guaranteed to be better than or equal to the old policy $\pi$. This is the Policy Improvement Theorem.

Algorithm Classification: Do We Need a Model of the Environment?

Algorithms that solve this GPI loop can be divided into two main categories:

Model-Based

The algorithm needs to know the complete model of the environment, i.e., the state transition probabilities $p(s'|s,a)$ and the reward function $R(s,a)$.

Typical Algorithm: Dynamic Programming
Advantage: High data efficiency.
Disadvantage: We almost never know the exact model of the environment in the real world (e.g., the transition probabilities of the stock market).

Model-Free

The algorithm does not need a model of the environment. It learns directly from samples (experience) generated by interacting with the environment.

Typical Algorithms: Q-Learning, Sarsa
Advantage: Much more applicable to complex, real-world problems.

Model-Free Methods: A Metaphor for Exploration

Model-Free Algorithm Classification: How Do We Learn?

Within model-free algorithms, we can further classify them based on their learning style:

On-policy

The agent learns from and improves the same policy it uses to make decisions.

Analogy: A novice driver learns to improve their own driving skills while they are actively driving.
Typical Algorithm: Sarsa.

Off-policy

The agent can learn from a policy that is different from the one it is currently using to explore. It can learn a target (greedy) policy while behaving according to an exploratory policy.

Analogy: An expert learns how to be a perfect driver by watching videos of a novice driver’s mistakes.
Typical Algorithm: Q-Learning.

Model-Free Method 1: Monte Carlo (MC)

MC methods are the most intuitive model-free learning approach.

Core Idea: Estimate the value function by simulating a large number of complete episodes. An episode is a full trajectory from a starting state to a terminal state.
How to estimate $Q_\pi(s,a)$?
1. Follow policy $\pi$ and play many episodes.
2. For every episode that visited the pair $(s,a)$, record the actual return $G_t$ that followed.
3. Average these returns to estimate $Q_\pi(s,a)$. \[ \large{ Q(s,a) \leftarrow \text{Average}(G_t \text{ for all visits to } (s,a)) } \]
Drawback: Only applicable to episodic tasks (with a clear end), and learning only happens after an episode is complete.

Model-Free Method 2: Temporal-Difference (TD)

TD learning is one of the most central and innovative ideas in reinforcement learning. It combines the advantages of MC and dynamic programming.

Core Idea: Learn from every single step, without waiting for the episode to end. It updates the value of the current state using an estimate of the next state’s value.
Comparison with MC:
- MC’s update target is the actual final return $G_t$.
- TD’s update target is an estimated future return $R_{t+1} + \gamma V(S_{t+1})$.
This process of updating an estimate with another estimate is called bootstrapping.

The TD(0) Algorithm’s Update Rule

The simplest TD algorithm, TD(0), has the following update rule: \[ \large{ V(S_t) \leftarrow V(S_t) + \alpha \underbrace{\left[ \overbrace{R_{t+1} + \gamma V(S_{t+1})}^{\text{TD Target}} - V(S_t) \right]}_{\text{TD Error}} } \]

$\alpha$: The learning rate, which determines how much we update our estimate based on the new information.
TD Target ($R_{t+1} + \gamma V(S_{t+1})$): A more accurate estimate of the future return than our current $V(S_t)$.
TD Error ($\delta_t$): The discrepancy between our current estimate and the more accurate target. The algorithm aims to minimize this error over time.

A Vivid Example: Walking from Dorm to Classroom

Imagine you want to predict how long it takes to walk from your dorm to the classroom. Your initial guess is 18 minutes.

MC Method: You must walk the entire way and find out it took 25 minutes. Only then can you go back and update your value estimate for the ‘Leave Dorm’ state.
TD Method: You get downstairs and see it’s raining. You immediately update your prediction: now you think it might take 23 minutes. You update your prediction at every step based on new information. This is more efficient and closer to how humans learn.

On-Policy TD Control Algorithm: Sarsa

Sarsa is an on-policy TD control algorithm that aims to learn the Q-function.

Name Origin: Its update rule requires a quintuple $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, which spells out S, A, R, S’, A’.
Update Rule: \[ \large{ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, \mathbf{A_{t+1}}) - Q(S_t, A_t)] } \]
Key Point: When updating $Q(S_t, A_t)$, it uses the Q-value corresponding to the action $A_{t+1}$ that is actually taken in state $S_{t+1}$.

The Exploration-Exploitation Dilemma: $\epsilon$-greedy Policy

To ensure the algorithm explores all possibilities, we typically don’t use a purely greedy policy (i.e., always choosing the action with the highest Q-value). Instead, we use an $\epsilon$-greedy policy.

With probability $1-\epsilon$, choose the best-estimated action: $a^* = \arg\max_a Q(s,a)$. (Exploit)
With probability $\epsilon$, choose an action randomly from all possible actions. (Explore)

The value of $\epsilon$ is often decayed over time, shifting the focus from exploration to exploitation as learning progresses.

Off-Policy TD Control Algorithm: Q-Learning

Q-Learning is one of the most famous and widely used algorithms in reinforcement learning. It is an off-policy algorithm.

Core Idea: It follows a behavior policy (e.g., $\epsilon$-greedy) to explore, while simultaneously learning the Q-values of a different target policy (the purely greedy policy).
Update Rule: \[ \large{ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)] } \]
Key Difference: Note the $\max_{a'} Q(S_{t+1}, a')$ in the update rule. It no longer cares which action was actually taken in $S_{t+1}$. It directly uses the maximum possible Q-value in $S_{t+1}$ for the update.

Sarsa vs. Q-Learning: A Cliff Walking Analogy

Imagine an agent walking along the edge of a cliff, trying to get from the start (S) to the goal (G). Falling off the cliff results in a large negative reward.

``` ## Sarsa vs. Q-Learning: Interpretation - **Sarsa (On-policy)**: It's a 'cautious' learner. Because it knows its next action is also exploratory (and might accidentally step off the cliff), it factors this risk into its Q-value updates. It learns a **safer**, albeit slower, path that stays far from the cliff. - **Q-Learning (Off-policy)**: It's a 'bold' learner. It always assumes the optimal action will be taken next, regardless of the fact that it might make mistakes while exploring. It learns the **fastest** path right along the cliff edge, because it believes it will eventually execute that path perfectly. | Feature | Sarsa (On-policy) | Q-Learning (Off-policy) | | :------------ | :---------------------------------------------- | :------------------------------------------------ | | **Objective** | Learns the value of its current behavior policy | Learns the value of the optimal policy | | **Update** | Uses the Q-value of the **next actual** action | Uses the Q-value of the **next possible best** action | | **Behavior** | More conservative, avoids risky shortcuts | More aggressive, favors optimal but risky paths | ## Python in Practice: Solving the 'FrozenLake' Problem with Q-Learning To make these concepts concrete, let's look at a classic RL problem: **FrozenLake**. - **Environment**: A 4x4 grid, with some tiles being safe frozen ice (F) and others being holes (H). - **Goal**: Navigate from the start (S) to the goal (G) without falling into a hole. - **States**: 16 grid positions. - **Actions**: Up, Down, Left, Right. - **Reward**: +1 for reaching the goal, 0 otherwise. - **Challenge**: The ice is slippery. An action to move 'right' might have some probability of sliding somewhere else. This is a classic MDP. ## Q-Learning Code Framework: Initialization We need a Q-table to store the value of each (state, action) pair. Initially, we know nothing, so we set all values to zero. ::: {#b5531c9e .cell execution_count=1} ``` {.python .cell-code} import numpy as np import pandas as pd # Environment parameters n_states = 16 # 4x4 grid n_actions = 4 # left, down, up, right # Initialize Q-table with all zeros q_table = np.zeros((n_states, n_actions)) print('Initialized Q-Table (Shape: {}):'.format(q_table.shape)) # Use pandas for pretty printing q_df = pd.DataFrame(q_table, columns=['Left', 'Down', 'Up', 'Right']) q_df.index.name = 'State' print(q_df.head())

Initialized Q-Table (Shape: (16, 4)):
       Left  Down   Up  Right
State                        
0       0.0   0.0  0.0    0.0
1       0.0   0.0  0.0    0.0
2       0.0   0.0  0.0    0.0
3       0.0   0.0  0.0    0.0
4       0.0   0.0  0.0    0.0

:::

Q-Learning Code Framework: The Main Loop

The core of the algorithm is a loop where the agent interacts with the environment and updates the Q-table. To make this code self-contained, we will simulate a simple, non-slippery version of the FrozenLake environment.

import numpy as np
import pandas as pd

# --- Mock Environment ---
MAP = ["SFFF", "FHFH", "FFFH", "HFFG"]
# Action effects: 0:Left, 1:Down, 2:Up, 3:Right
ACTION_EFFECTS = [-1, 4, -4, 1] 
GOAL_STATE = 15
HOLES = [5,7,11,12]

def mock_step(state, action):
    new_state = state + ACTION_EFFECTS[action]
    # Boundary checks
    if (action == 0 and state % 4 == 0) or \
       (action == 3 and state % 4 == 3) or \
       (action == 2 and state < 4) or \
       (action == 1 and state > 11):
        new_state = state # Stay put if hitting a wall
    
    if new_state in HOLES:
        return new_state, 0, True # Fell in a hole, no reward, episode ends
    elif new_state == GOAL_STATE:
        return new_state, 1, True # Reached goal, reward 1, episode ends
    else:
        return new_state, 0, False # Normal move

# --- Q-Learning Algorithm ---
q_table = np.zeros((16, 4))
learning_rate = 0.1
gamma = 0.99
epsilon = 0.1
n_episodes = 5000

for _ in range(n_episodes):
    state = 0 # Start at S
    done = False
    while not done:
        if np.random.uniform(0, 1) < epsilon:
            action = np.random.randint(0, 4) # Explore
        else:
            action = np.argmax(q_table[state, :]) # Exploit

        new_state, reward, done = mock_step(state, action)
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[new_state, :])
        
        # The core Q-learning update formula
        new_value = old_value + learning_rate * (reward + gamma * next_max - old_value)
        q_table[state, action] = new_value
        
        state = new_state

# --- Print Results ---
print("Final Q-Table:")
q_df_final = pd.DataFrame(q_table, columns=['Left', 'Down', 'Up', 'Right'])
q_df_final.index.name = 'State'
print(q_df_final.round(3))

Final Q-Table:
       Left  Down   Up  Right
State                        
0       0.0   0.0  0.0    0.0
1       0.0   0.0  0.0    0.0
2       0.0   0.0  0.0    0.0
3       0.0   0.0  0.0    0.0
4       0.0   0.0  0.0    0.0
5       0.0   0.0  0.0    0.0
6       0.0   0.0  0.0    0.0
7       0.0   0.0  0.0    0.0
8       0.0   0.0  0.0    0.0
9       0.0   0.0  0.0    0.0
10      0.0   0.0  0.0    0.0
11      0.0   0.0  0.0    0.0
12      0.0   0.0  0.0    0.0
13      0.0   0.0  0.0    0.0
14      0.0   0.0  0.0    0.0
15      0.0   0.0  0.0    0.0

From Tables to Reality: The Curse of Dimensionality

The tabular method (Q-table) we just discussed works well for problems with small state and action spaces.

But what about real-world economic problems?

The state space of Chess is approximately $10^{47}$.
The state space of Go is approximately $10^{170}$.
The state of a self-driving car is continuous (position, velocity, orientation), making the state space infinite.

It is impossible to create a Q-table for these problems. This is where Deep Reinforcement Learning (DRL) comes in.

Deep Reinforcement Learning: Neural Networks for RL

The core idea of DRL is to use a deep neural network to approximate the value function or policy function, instead of a table.

Value Network: Takes a state $s$ as input and outputs its value $V(s)$, or the Q-value for each action $Q(s,a)$. \[ \large{ Q(s, a; \mathbf{w}) \approx Q^*(s, a) } \] Here, $\mathbf{w}$ represents the weights of the neural network.
Policy Network: Takes a state $s$ as input and outputs the probability of taking each action, $\pi(a|s)$.

Milestone: The Deep Q-Network (DQN)

In 2013, DeepMind’s DQN algorithm shocked the world by learning to play a wide range of Atari games at a superhuman level, using only the raw screen pixels as input.

The core of DQN is using a deep convolutional neural network (CNN) to approximate the optimal action-value function $Q^*(s,a)$.

``` ## DQN's Loss Function: How to Train the Network? How do we train this Q-network? We want the network's prediction $Q(s, a; \mathbf{w})$ to be as close as possible to the 'target value' given by the Bellman Optimality Equation. Therefore, we can define a loss function similar to supervised learning, typically the mean squared error (MSE): $$ \large{ L(\mathbf{w}) = E \left[ \left( \underbrace{R + \gamma \max_{a'} Q(s', a'; \mathbf{w})}_{\text{TD Target, a.k.a. y}} - \underbrace{Q(s, a; \mathbf{w})}_{\text{Prediction}} \right)^2 \right] } $$ We can then use gradient descent to optimize the network weights $\mathbf{w}$. ## DQN's Secret 1: Experience Replay Training directly with the above method can be very unstable. DQN introduced two key techniques. The first is **Experience Replay**. ```{=html} ``` - **Effect**: 1. **Breaks data correlation**, satisfying the i.i.d. assumption for training. 2. **Improves data efficiency** by reusing experiences. ## DQN's Secret 2: Target Network The second reason for instability is that we use the same network for both prediction and target calculation. The target value $y = R + \gamma \max_{a'} Q(s', a'; \mathbf{w})$ changes with every update to $\mathbf{w}$, which is like chasing a moving target. ```{=html}

Effect: The target network w- is ‘frozen’ for a period, making the learning target much more stable.

The Overestimation Problem in DQN

Standard DQN tends to systematically overestimate Q-values because of the max operator used in calculating the TD target. \[ \large{ y_t = R_{t+1} + \gamma \max_{a'} Q(s', a'; \mathbf{w}) } \] If the Q-value estimates themselves have noise, taking the maximum will amplify positive noise, leading to an optimistic bias. This bias can propagate and accumulate through the bootstrapping process.

Solution: Double DQN (DDQN)

DDQN addresses the overestimation problem by decoupling ‘action selection’ from ‘value evaluation’.

Standard DQN Target: \[ \large{ Y_t^{DQN} = R_{t+1} + \gamma Q(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a'; \mathbf{w}_t); \mathbf{w}_t) } \]
DDQN Target: It uses the main network to select the best action, but uses the target network to evaluate the value of that action. \[ \large{ Y_t^{DDQN} = R_{t+1} + \gamma Q(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a'; \mathbf{w}_t); \mathbf{w}_t^-) } \]
This decoupling leads to more conservative and accurate value estimates.

Further Decomposing the Value Function: Dueling DQN

Dueling DQN introduces a new network architecture that splits the Q-value estimate into two streams:

State Value ($V(s)$): How good is it to be in this state, regardless of the action taken.
Advantage Function ($A(s,a)$): How much better is it to take action $a$ compared to the average action in this state.

\[ \large{ Q(s,a) = V(s) + \left( A(s,a) - \frac{1}{|\mathcal{A}|} \sum_{a' \in \mathcal{A}} A(s,a') \right) } \]

This architecture allows the network to learn the intrinsic value of states more efficiently, especially in situations where many actions have similar values.

Conclusion: RL Provides a New Paradigm for Economic Decision-Making

Core Framework: MDP provides a unified mathematical language for sequential decision problems.
Core Idea: Through an iterative loop of policy evaluation and improvement, the agent converges to an optimal decision-making process.
Core Algorithms: From Q-Learning to DQN and its variants, algorithmic advancements are making it possible to solve increasingly complex problems.
Economic Significance: RL is more than just a tool for game AI; it offers a powerful, data-driven toolkit for dynamic pricing, resource management, portfolio optimization, auction design, and many other fields in economics.

Future Directions: RL’s Application in Economics

Market Modeling: Simulating markets with multiple RL agents to study the emergence of market equilibria and complex dynamics.
Algorithmic Game Theory: Designing agents that perform optimally in competitive or cooperative environments, such as auctions or supply chain negotiations.
Personalized Policies: Developing dynamic, individualized strategies for each user (e.g., in e-commerce recommendations) or each asset (e.g., in a portfolio).

Reinforcement learning is evolving from a subfield of computer science into a general framework for understanding and optimizing intelligent decision-making in complex systems.