Reinforcement Learning Basics

July 22, 2025 · 9 min read

Researcher

Reinforcement Learning is a subfield of machine learning that focuses on training agents to make decisions in environments by maximizing a reward signal. This guide will introduce the key concepts essential for Reinforcement Learning, covering Value/Policy Iteration, Q-Learning and Variants, Policy Gradient Methods, etc. We refer to OpenAI Spinning Up and CMU 16-831 for more details.

Recap of Markov Decision Processes

A Markov Decision Process (MDP) provides the mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by the following components:

State Space ( $\mathcal{S}$ ): The set of all possible states in the environment
- Each state $s \in \mathcal{S}$ contains all relevant information needed to make decisions
- The Markov property states that the future only depends on the current state, not the history
Action Space ( $\mathcal{A}$ ): The set of all possible actions available to the agent
- Actions can be discrete (finite set) or continuous
- The available actions may depend on the current state
Transition Function $P(s'|s,a)$ : The probability of transitioning to state $s'$ when taking action $a$ in state $s$
- $P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]$
- Must satisfy $\sum_{s' \in \mathcal{S}} P(s'|s,a) = 1$ for all $s,a$
- Captures the dynamics of the environment
Reward Function $R(s,a,s')$ : The immediate reward received when transitioning from $s$ to $s'$ via action $a$
- $R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$
- Defines the goal/objective for the agent
- Can also be written as $R(s,a)$ or $R(s)$ depending on dependencies
Discount Factor ( $\gamma$ ): A value in $[0,1]$ that determines the importance of future rewards
- $\gamma = 0$ : Myopic evaluation (only immediate rewards matter)
- $\gamma \rightarrow 1$ : Far-sighted evaluation (future rewards are important)
- Ensures convergence of infinite horizon problems

The goal in an MDP is to find an optimal policy $\pi^*(s)$ that maximizes the expected cumulative discounted reward:

V^*(s) = \max_\pi \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t,a_t,s_{t+1}) | s_0 = s \right]

Key Functions:

Value Function $V(s)$ : Expected return starting from state $s$
Action-Value Function $Q(s,a)$ : Expected return starting from state $s$ , taking action $a$
Policy $\pi(a|s)$ : Probability distribution over actions given the current state

The optimal policy and value function are related through the Bellman optimality equations:

V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]

Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]

Value/Policy Iteration

Value iteration and policy iteration are fundamental algorithms in reinforcement learning for solving Markov Decision Processes (MDPs). These algorithms aim to find optimal policies in environments with known transition dynamics and reward functions.

Value Iteration

Value iteration works by iteratively updating state values using the Bellman optimality equation:

V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V_k(s')]

where:

$V_k(s)$ is the value of state $s$ at iteration $k$
$P(s'|s,a)$ is the transition probability from state $s$ to $s'$ under action $a$
$R(s,a,s')$ is the immediate reward
$\gamma$ is the discount factor
$\max_a$ finds the action that maximizes expected future value

The algorithm continues until the value function converges (i.e., the maximum change in value between iterations is below a threshold ε). Once converged, the optimal policy can be extracted by:

\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]

Policy Iteration

Policy iteration alternates between two steps:

Policy Evaluation: For a fixed policy π, compute the value function by solving:
$V^\pi(s) = \sum_{s'} P(s'|s,\pi(s))[R(s,\pi(s),s') + \gamma V^\pi(s')]$
This is typically done iteratively until convergence.
Policy Improvement: Update the policy to be greedy with respect to the current value function:
$\pi_{new}(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$

The algorithm continues alternating between these steps until the policy no longer changes. Policy iteration is guaranteed to converge to the optimal policy in finite MDPs.

Comparison and Practical Considerations

Value iteration typically requires fewer iterations than policy iteration but does more computation per iteration
Policy iteration may converge in fewer iterations but requires solving a system of linear equations in each evaluation step
Modified Policy Iteration (MPI) combines aspects of both by performing partial policy evaluation steps
In practice, both algorithms may be impractical for large state spaces, leading to approximate versions using function approximation

Q-Learning and Variants

Q-Learning is a model-free reinforcement learning algorithm that learns an action-value function Q(s,a) representing the expected return of taking action a in state s. The core update equation is:

Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma \max_a Q(s_{t+1},a) - Q(s_t,a_t)]

Deep Q-Network (DQN) extends this by using neural networks to approximate the Q-function, adding experience replay and target networks for stability.

Policy Gradient Methods

Policy gradient methods directly optimize a policy π(a|s) to maximize expected return. The key insight is the policy gradient theorem:

\nabla_\theta J(\pi_\theta) = E_{\tau \sim \pi_\theta}[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)R(\tau)]

Common algorithms include:

REINFORCE: The simplest policy gradient algorithm
PPO (Proximal Policy Optimization): Uses clipped surrogate objective for stable updates
TRPO (Trust Region Policy Optimization): Ensures policy updates stay within trust regions

Actor-Critic Methods

Actor-critic methods combine policy gradient approaches with value function approximation. The actor (policy) learns to take actions while the critic (value function) evaluates those actions. This architecture addresses key limitations of pure policy gradient and value-based methods:

Reduced Variance: The critic provides a lower-variance baseline for policy updates compared to Monte Carlo returns
Online Updates: Unlike REINFORCE, actor-critic methods can learn online without waiting for episode completion
Structured Exploration: The actor can learn sophisticated exploration strategies while the critic evaluates their effectiveness

The general update equations for actor-critic methods are:

Critic Update:

\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \quad \text{(TD Error)}

\theta_v \leftarrow \theta_v + \alpha_v \delta_t \nabla_{\theta_v} V(s_t)

Actor Update:

\theta_\pi \leftarrow \theta_\pi + \alpha_\pi \delta_t \nabla_{\theta_\pi} \log \pi_{\theta_\pi}(a_t|s_t)

Key algorithms include:

A2C/A3C (Advantage Actor-Critic)

Uses advantage function $A(s,a) = Q(s,a) - V(s)$ to reduce variance
A3C runs multiple parallel actors for better exploration and stability
Synchronous (A2C) vs asynchronous (A3C) variants trade off between computational efficiency and stability
Typical architecture uses shared lower layers between actor and critic networks

DDPG (Deep Deterministic Policy Gradient)

Designed specifically for continuous action spaces
Uses deterministic policy gradient theorem: $\nabla_{\theta_\pi} J(\pi_\theta) = E_{s\sim\rho^\pi}[\nabla_a Q(s,a)|_{a=\pi(s)} \nabla_{\theta_\pi} \pi_{\theta_\pi}(s)]$
Employs target networks and experience replay for stability
Adds exploration noise (typically Ornstein-Uhlenbeck process) to the deterministic policy
Architecture includes:
- Actor network: $\mu(s|\theta^\mu)$
- Critic network: $Q(s,a|\theta^Q)$
- Target networks: $\mu'$ and $Q'$ with soft updates

SAC (Soft Actor-Critic)

Maximizes both expected return and policy entropy
Objective: $J(\pi) = \sum_t E_{(s_t,a_t)\sim\rho_\pi}[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))]$
Uses two Q-functions to mitigate overestimation bias
Automatically tunes temperature parameter $\alpha$ for optimal entropy
Key components:
- Soft value function: $V(s_t) = E_{a_t\sim\pi}[Q(s_t,a_t) - \alpha\log\pi(a_t|s_t)]$
- Soft Q-function: $Q(s_t,a_t) = r_t + \gamma E_{s_{t+1}}[V(s_{t+1})]$
- Policy: $\pi(a_t|s_t) \propto \exp(\frac{1}{\alpha}Q(s_t,a_t))$

TD3 (Twin Delayed DDPG)

Addresses overestimation bias in DDPG using double Q-learning
Delays policy updates to reduce variance
Uses clipped double Q-learning and target policy smoothing
Particularly effective for complex continuous control tasks

Implementation considerations:

Network architectures typically use separate networks for actor and critic
Learning rates often differ between actor and critic ( $\alpha_\pi < \alpha_v$ )
Gradient clipping and batch normalization can improve stability
Experience replay buffer size and batch size significantly impact performance

Advanced RL Algorithms

Modern RL research has produced several sophisticated algorithms that address different challenges in reinforcement learning:

Model-Based RL

Learns an explicit model of environment dynamics: $s_{t+1} = f(s_t, a_t)$
Uses learned model for planning and policy optimization
Key advantages:
- Improved sample efficiency through model-based planning
- Better generalization to new scenarios
Notable algorithms:
- PETS (Probabilistic Ensembles with Trajectory Sampling)
- MBPO (Model-Based Policy Optimization)
- PlaNet (Deep Planning Network)

Hierarchical RL

Decomposes complex tasks into hierarchical structure of subtasks
Uses multiple levels of temporal abstraction
Components typically include:
- High-level policy (meta-controller): Selects subtasks/goals
- Low-level policy (controller): Executes primitive actions
Popular frameworks:
- Options Framework
- HAM (Hierarchical Abstract Machines)
- MAXQ Decomposition

Multi-Agent RL

Handles systems with multiple interacting agents
Key challenges:
- Non-stationarity due to changing agent behaviors
- Credit assignment across agents
- Coordination and communication
Major approaches:
- Independent Learning: Each agent learns independently
- Centralized Training with Decentralized Execution (CTDE)
- Fully Centralized: Joint action-value function
Notable algorithms:
- QMIX: Monotonic value function factorization
- MADDPG: Multi-agent extension of DDPG
- MAPPO: Multi-agent PPO variant

Meta-RL

Learns to quickly adapt to new tasks using previous experience
Key components:
- Meta-learner: Learns general learning strategy
- Task-specific learner: Adapts to specific tasks
Common approaches:
- Recurrent architectures (RL²)
- Gradient-based meta-learning (MAML)
- Context-based adaptation (PEARL)
Applications:
- Few-shot task adaptation
- Continuous learning and adaptation
- Transfer learning across domains

Offline RL

Learns optimal policies from fixed datasets without environment interaction
Addresses challenges like:
- Distribution shift
- Out-of-distribution actions
- Conservative policy updates
Key algorithms:
- Conservative Q-Learning (CQL)
- Behavior Regularized Actor Critic (BRAC)
- Implicit Q-Learning (IQL)
Applications:
- Healthcare
- Robotics from demonstrations
- Industrial process control

Each approach offers distinct advantages and faces unique challenges:

Sample Efficiency: Model-based methods typically require fewer interactions
Stability: Offline RL provides stable learning from fixed datasets
Scalability: Multi-agent methods face exponential complexity with agent count
Generalization: Meta-RL enables quick adaptation but requires diverse training tasks
Computational Cost: Hierarchical methods reduce search space but increase architectural complexity

Recap of Markov Decision Processes​

Value/Policy Iteration​

Value Iteration​

Policy Iteration​

Comparison and Practical Considerations​

Q-Learning and Variants​

Policy Gradient Methods​

Actor-Critic Methods​

A2C/A3C (Advantage Actor-Critic)​

DDPG (Deep Deterministic Policy Gradient)​

SAC (Soft Actor-Critic)​

TD3 (Twin Delayed DDPG)​

Advanced RL Algorithms​

Model-Based RL​

Hierarchical RL​

Multi-Agent RL​

Meta-RL​

Offline RL​