Skip to main content

Reinforcement Learning Basics

· 9 min read
Mingtong Zhang

Reinforcement Learning is a subfield of machine learning that focuses on training agents to make decisions in environments by maximizing a reward signal. This guide will introduce the key concepts essential for Reinforcement Learning, covering Value/Policy Iteration, Q-Learning and Variants, Policy Gradient Methods, etc. We refer to OpenAI Spinning Up and CMU 16-831 for more details.

Recap of Markov Decision Processes

A Markov Decision Process (MDP) provides the mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by the following components:

  1. State Space (S\mathcal{S}): The set of all possible states in the environment

    • Each state sSs \in \mathcal{S} contains all relevant information needed to make decisions
    • The Markov property states that the future only depends on the current state, not the history
  2. Action Space (A\mathcal{A}): The set of all possible actions available to the agent

    • Actions can be discrete (finite set) or continuous
    • The available actions may depend on the current state
  3. Transition Function P(ss,a)P(s'|s,a): The probability of transitioning to state ss' when taking action aa in state ss

    • P:S×A×S[0,1]P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]
    • Must satisfy sSP(ss,a)=1\sum_{s' \in \mathcal{S}} P(s'|s,a) = 1 for all s,as,a
    • Captures the dynamics of the environment
  4. Reward Function R(s,a,s)R(s,a,s'): The immediate reward received when transitioning from ss to ss' via action aa

    • R:S×A×SRR: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}
    • Defines the goal/objective for the agent
    • Can also be written as R(s,a)R(s,a) or R(s)R(s) depending on dependencies
  5. Discount Factor (γ\gamma): A value in [0,1][0,1] that determines the importance of future rewards

    • γ=0\gamma = 0: Myopic evaluation (only immediate rewards matter)
    • γ1\gamma \rightarrow 1: Far-sighted evaluation (future rewards are important)
    • Ensures convergence of infinite horizon problems

The goal in an MDP is to find an optimal policy π(s)\pi^*(s) that maximizes the expected cumulative discounted reward:

V(s)=maxπEπ[t=0γtR(st,at,st+1)s0=s]V^*(s) = \max_\pi \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t,a_t,s_{t+1}) | s_0 = s \right]

Key Functions:

  • Value Function V(s)V(s): Expected return starting from state ss
  • Action-Value Function Q(s,a)Q(s,a): Expected return starting from state ss, taking action aa
  • Policy π(as)\pi(a|s): Probability distribution over actions given the current state

The optimal policy and value function are related through the Bellman optimality equations:

V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]
Q(s,a)=sP(ss,a)[R(s,a,s)+γmaxaQ(s,a)]Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]

Value/Policy Iteration

Value iteration and policy iteration are fundamental algorithms in reinforcement learning for solving Markov Decision Processes (MDPs). These algorithms aim to find optimal policies in environments with known transition dynamics and reward functions.

Value Iteration

Value iteration works by iteratively updating state values using the Bellman optimality equation:

Vk+1(s)=maxasP(ss,a)[R(s,a,s)+γVk(s)]V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V_k(s')]

where:

  • Vk(s)V_k(s) is the value of state ss at iteration kk
  • P(ss,a)P(s'|s,a) is the transition probability from state ss to ss' under action aa
  • R(s,a,s)R(s,a,s') is the immediate reward
  • γ\gamma is the discount factor
  • maxa\max_a finds the action that maximizes expected future value

The algorithm continues until the value function converges (i.e., the maximum change in value between iterations is below a threshold ε). Once converged, the optimal policy can be extracted by:

π(s)=argmaxasP(ss,a)[R(s,a,s)+γV(s)]\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]

Policy Iteration

Policy iteration alternates between two steps:

  1. Policy Evaluation: For a fixed policy π, compute the value function by solving:

    Vπ(s)=sP(ss,π(s))[R(s,π(s),s)+γVπ(s)]V^\pi(s) = \sum_{s'} P(s'|s,\pi(s))[R(s,\pi(s),s') + \gamma V^\pi(s')]

    This is typically done iteratively until convergence.

  2. Policy Improvement: Update the policy to be greedy with respect to the current value function:

    πnew(s)=argmaxasP(ss,a)[R(s,a,s)+γVπ(s)]\pi_{new}(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]

The algorithm continues alternating between these steps until the policy no longer changes. Policy iteration is guaranteed to converge to the optimal policy in finite MDPs.

Comparison and Practical Considerations

  • Value iteration typically requires fewer iterations than policy iteration but does more computation per iteration
  • Policy iteration may converge in fewer iterations but requires solving a system of linear equations in each evaluation step
  • Modified Policy Iteration (MPI) combines aspects of both by performing partial policy evaluation steps
  • In practice, both algorithms may be impractical for large state spaces, leading to approximate versions using function approximation

Q-Learning and Variants

Q-Learning is a model-free reinforcement learning algorithm that learns an action-value function Q(s,a) representing the expected return of taking action a in state s. The core update equation is:

Q(st,at)Q(st,at)+α[rt+γmaxaQ(st+1,a)Q(st,at)]Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t + \gamma \max_a Q(s_{t+1},a) - Q(s_t,a_t)]

Deep Q-Network (DQN) extends this by using neural networks to approximate the Q-function, adding experience replay and target networks for stability.

Policy Gradient Methods

Policy gradient methods directly optimize a policy π(a|s) to maximize expected return. The key insight is the policy gradient theorem:

θJ(πθ)=Eτπθ[tθlogπθ(atst)R(τ)]\nabla_\theta J(\pi_\theta) = E_{\tau \sim \pi_\theta}[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)R(\tau)]

Common algorithms include:

  • REINFORCE: The simplest policy gradient algorithm
  • PPO (Proximal Policy Optimization): Uses clipped surrogate objective for stable updates
  • TRPO (Trust Region Policy Optimization): Ensures policy updates stay within trust regions

Actor-Critic Methods

Actor-critic methods combine policy gradient approaches with value function approximation. The actor (policy) learns to take actions while the critic (value function) evaluates those actions. This architecture addresses key limitations of pure policy gradient and value-based methods:

  • Reduced Variance: The critic provides a lower-variance baseline for policy updates compared to Monte Carlo returns
  • Online Updates: Unlike REINFORCE, actor-critic methods can learn online without waiting for episode completion
  • Structured Exploration: The actor can learn sophisticated exploration strategies while the critic evaluates their effectiveness

The general update equations for actor-critic methods are:

Critic Update:

δt=rt+γV(st+1)V(st)(TD Error)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \quad \text{(TD Error)}
θvθv+αvδtθvV(st)\theta_v \leftarrow \theta_v + \alpha_v \delta_t \nabla_{\theta_v} V(s_t)

Actor Update:

θπθπ+απδtθπlogπθπ(atst)\theta_\pi \leftarrow \theta_\pi + \alpha_\pi \delta_t \nabla_{\theta_\pi} \log \pi_{\theta_\pi}(a_t|s_t)

Key algorithms include:

A2C/A3C (Advantage Actor-Critic)

  • Uses advantage function A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s) to reduce variance
  • A3C runs multiple parallel actors for better exploration and stability
  • Synchronous (A2C) vs asynchronous (A3C) variants trade off between computational efficiency and stability
  • Typical architecture uses shared lower layers between actor and critic networks

DDPG (Deep Deterministic Policy Gradient)

  • Designed specifically for continuous action spaces
  • Uses deterministic policy gradient theorem: θπJ(πθ)=Esρπ[aQ(s,a)a=π(s)θππθπ(s)]\nabla_{\theta_\pi} J(\pi_\theta) = E_{s\sim\rho^\pi}[\nabla_a Q(s,a)|_{a=\pi(s)} \nabla_{\theta_\pi} \pi_{\theta_\pi}(s)]
  • Employs target networks and experience replay for stability
  • Adds exploration noise (typically Ornstein-Uhlenbeck process) to the deterministic policy
  • Architecture includes:
    • Actor network: μ(sθμ)\mu(s|\theta^\mu)
    • Critic network: Q(s,aθQ)Q(s,a|\theta^Q)
    • Target networks: μ\mu' and QQ' with soft updates

SAC (Soft Actor-Critic)

  • Maximizes both expected return and policy entropy
  • Objective: J(π)=tE(st,at)ρπ[r(st,at)+αH(π(st))]J(\pi) = \sum_t E_{(s_t,a_t)\sim\rho_\pi}[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))]
  • Uses two Q-functions to mitigate overestimation bias
  • Automatically tunes temperature parameter α\alpha for optimal entropy
  • Key components:
    • Soft value function: V(st)=Eatπ[Q(st,at)αlogπ(atst)]V(s_t) = E_{a_t\sim\pi}[Q(s_t,a_t) - \alpha\log\pi(a_t|s_t)]
    • Soft Q-function: Q(st,at)=rt+γEst+1[V(st+1)]Q(s_t,a_t) = r_t + \gamma E_{s_{t+1}}[V(s_{t+1})]
    • Policy: π(atst)exp(1αQ(st,at))\pi(a_t|s_t) \propto \exp(\frac{1}{\alpha}Q(s_t,a_t))

TD3 (Twin Delayed DDPG)

  • Addresses overestimation bias in DDPG using double Q-learning
  • Delays policy updates to reduce variance
  • Uses clipped double Q-learning and target policy smoothing
  • Particularly effective for complex continuous control tasks

Implementation considerations:

  • Network architectures typically use separate networks for actor and critic
  • Learning rates often differ between actor and critic (απ<αv\alpha_\pi < \alpha_v)
  • Gradient clipping and batch normalization can improve stability
  • Experience replay buffer size and batch size significantly impact performance

Advanced RL Algorithms

Modern RL research has produced several sophisticated algorithms that address different challenges in reinforcement learning:

Model-Based RL

  • Learns an explicit model of environment dynamics: st+1=f(st,at)s_{t+1} = f(s_t, a_t)
  • Uses learned model for planning and policy optimization
  • Key advantages:
    • Improved sample efficiency through model-based planning
    • Better generalization to new scenarios
  • Notable algorithms:
    • PETS (Probabilistic Ensembles with Trajectory Sampling)
    • MBPO (Model-Based Policy Optimization)
    • PlaNet (Deep Planning Network)

Hierarchical RL

  • Decomposes complex tasks into hierarchical structure of subtasks
  • Uses multiple levels of temporal abstraction
  • Components typically include:
    • High-level policy (meta-controller): Selects subtasks/goals
    • Low-level policy (controller): Executes primitive actions
  • Popular frameworks:
    • Options Framework
    • HAM (Hierarchical Abstract Machines)
    • MAXQ Decomposition

Multi-Agent RL

  • Handles systems with multiple interacting agents
  • Key challenges:
    • Non-stationarity due to changing agent behaviors
    • Credit assignment across agents
    • Coordination and communication
  • Major approaches:
    • Independent Learning: Each agent learns independently
    • Centralized Training with Decentralized Execution (CTDE)
    • Fully Centralized: Joint action-value function
  • Notable algorithms:
    • QMIX: Monotonic value function factorization
    • MADDPG: Multi-agent extension of DDPG
    • MAPPO: Multi-agent PPO variant

Meta-RL

  • Learns to quickly adapt to new tasks using previous experience
  • Key components:
    • Meta-learner: Learns general learning strategy
    • Task-specific learner: Adapts to specific tasks
  • Common approaches:
    • Recurrent architectures (RL²)
    • Gradient-based meta-learning (MAML)
    • Context-based adaptation (PEARL)
  • Applications:
    • Few-shot task adaptation
    • Continuous learning and adaptation
    • Transfer learning across domains

Offline RL

  • Learns optimal policies from fixed datasets without environment interaction
  • Addresses challenges like:
    • Distribution shift
    • Out-of-distribution actions
    • Conservative policy updates
  • Key algorithms:
    • Conservative Q-Learning (CQL)
    • Behavior Regularized Actor Critic (BRAC)
    • Implicit Q-Learning (IQL)
  • Applications:
    • Healthcare
    • Robotics from demonstrations
    • Industrial process control

Each approach offers distinct advantages and faces unique challenges:

  • Sample Efficiency: Model-based methods typically require fewer interactions
  • Stability: Offline RL provides stable learning from fixed datasets
  • Scalability: Multi-agent methods face exponential complexity with agent count
  • Generalization: Meta-RL enables quick adaptation but requires diverse training tasks
  • Computational Cost: Hierarchical methods reduce search space but increase architectural complexity