Reinforcement Learning is a subfield of machine learning that focuses on training agents to make decisions in environments by maximizing a reward signal. This guide will introduce the key concepts essential for Reinforcement Learning, covering Value/Policy Iteration, Q-Learning and Variants, Policy Gradient Methods, etc. We refer to OpenAI Spinning Up and CMU 16-831 for more details.
Recap of Markov Decision Processes
A Markov Decision Process (MDP) provides the mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by the following components:
State Space (): The set of all possible states in the environment
- Each state contains all relevant information needed to make decisions
- The Markov property states that the future only depends on the current state, not the history
Action Space (): The set of all possible actions available to the agent
- Actions can be discrete (finite set) or continuous
- The available actions may depend on the current state
Transition Function : The probability of transitioning to state when taking action in state
- Must satisfy for all
- Captures the dynamics of the environment
Reward Function : The immediate reward received when transitioning from to via action
- Defines the goal/objective for the agent
- Can also be written as or depending on dependencies
Discount Factor (): A value in that determines the importance of future rewards
- : Myopic evaluation (only immediate rewards matter)
- : Far-sighted evaluation (future rewards are important)
- Ensures convergence of infinite horizon problems
The goal in an MDP is to find an optimal policy that maximizes the expected cumulative discounted reward:
Key Functions:
- Value Function : Expected return starting from state
- Action-Value Function : Expected return starting from state , taking action
- Policy : Probability distribution over actions given the current state
The optimal policy and value function are related through the Bellman optimality equations:
Value/Policy Iteration
Value iteration and policy iteration are fundamental algorithms in reinforcement learning for solving Markov Decision Processes (MDPs). These algorithms aim to find optimal policies in environments with known transition dynamics and reward functions.
Value Iteration
Value iteration works by iteratively updating state values using the Bellman optimality equation:
where:
- is the value of state at iteration
- is the transition probability from state to under action
- is the immediate reward
- is the discount factor
- finds the action that maximizes expected future value
The algorithm continues until the value function converges (i.e., the maximum change in value between iterations is below a threshold ε). Once converged, the optimal policy can be extracted by:
Policy Iteration
Policy iteration alternates between two steps:
Policy Evaluation: For a fixed policy π, compute the value function by solving:
This is typically done iteratively until convergence.
Policy Improvement: Update the policy to be greedy with respect to the current value function:
The algorithm continues alternating between these steps until the policy no longer changes. Policy iteration is guaranteed to converge to the optimal policy in finite MDPs.
Comparison and Practical Considerations
- Value iteration typically requires fewer iterations than policy iteration but does more computation per iteration
- Policy iteration may converge in fewer iterations but requires solving a system of linear equations in each evaluation step
- Modified Policy Iteration (MPI) combines aspects of both by performing partial policy evaluation steps
- In practice, both algorithms may be impractical for large state spaces, leading to approximate versions using function approximation
Q-Learning and Variants
Q-Learning is a model-free reinforcement learning algorithm that learns an action-value function Q(s,a) representing the expected return of taking action a in state s. The core update equation is:
Deep Q-Network (DQN) extends this by using neural networks to approximate the Q-function, adding experience replay and target networks for stability.
Policy Gradient Methods
Policy gradient methods directly optimize a policy π(a|s) to maximize expected return. The key insight is the policy gradient theorem:
Common algorithms include:
- REINFORCE: The simplest policy gradient algorithm
- PPO (Proximal Policy Optimization): Uses clipped surrogate objective for stable updates
- TRPO (Trust Region Policy Optimization): Ensures policy updates stay within trust regions
Actor-Critic Methods
Actor-critic methods combine policy gradient approaches with value function approximation. The actor (policy) learns to take actions while the critic (value function) evaluates those actions. This architecture addresses key limitations of pure policy gradient and value-based methods:
- Reduced Variance: The critic provides a lower-variance baseline for policy updates compared to Monte Carlo returns
- Online Updates: Unlike REINFORCE, actor-critic methods can learn online without waiting for episode completion
- Structured Exploration: The actor can learn sophisticated exploration strategies while the critic evaluates their effectiveness
The general update equations for actor-critic methods are:
Critic Update:
Actor Update:
Key algorithms include:
A2C/A3C (Advantage Actor-Critic)
- Uses advantage function to reduce variance
- A3C runs multiple parallel actors for better exploration and stability
- Synchronous (A2C) vs asynchronous (A3C) variants trade off between computational efficiency and stability
- Typical architecture uses shared lower layers between actor and critic networks
DDPG (Deep Deterministic Policy Gradient)
- Designed specifically for continuous action spaces
- Uses deterministic policy gradient theorem:
- Employs target networks and experience replay for stability
- Adds exploration noise (typically Ornstein-Uhlenbeck process) to the deterministic policy
- Architecture includes:
- Actor network:
- Critic network:
- Target networks: and with soft updates
SAC (Soft Actor-Critic)
- Maximizes both expected return and policy entropy
- Objective:
- Uses two Q-functions to mitigate overestimation bias
- Automatically tunes temperature parameter for optimal entropy
- Key components:
- Soft value function:
- Soft Q-function:
- Policy:
TD3 (Twin Delayed DDPG)
- Addresses overestimation bias in DDPG using double Q-learning
- Delays policy updates to reduce variance
- Uses clipped double Q-learning and target policy smoothing
- Particularly effective for complex continuous control tasks
Implementation considerations:
- Network architectures typically use separate networks for actor and critic
- Learning rates often differ between actor and critic ()
- Gradient clipping and batch normalization can improve stability
- Experience replay buffer size and batch size significantly impact performance
Advanced RL Algorithms
Modern RL research has produced several sophisticated algorithms that address different challenges in reinforcement learning:
Model-Based RL
- Learns an explicit model of environment dynamics:
- Uses learned model for planning and policy optimization
- Key advantages:
- Improved sample efficiency through model-based planning
- Better generalization to new scenarios
- Notable algorithms:
- PETS (Probabilistic Ensembles with Trajectory Sampling)
- MBPO (Model-Based Policy Optimization)
- PlaNet (Deep Planning Network)
Hierarchical RL
- Decomposes complex tasks into hierarchical structure of subtasks
- Uses multiple levels of temporal abstraction
- Components typically include:
- High-level policy (meta-controller): Selects subtasks/goals
- Low-level policy (controller): Executes primitive actions
- Popular frameworks:
- Options Framework
- HAM (Hierarchical Abstract Machines)
- MAXQ Decomposition
Multi-Agent RL
- Handles systems with multiple interacting agents
- Key challenges:
- Non-stationarity due to changing agent behaviors
- Credit assignment across agents
- Coordination and communication
- Major approaches:
- Independent Learning: Each agent learns independently
- Centralized Training with Decentralized Execution (CTDE)
- Fully Centralized: Joint action-value function
- Notable algorithms:
- QMIX: Monotonic value function factorization
- MADDPG: Multi-agent extension of DDPG
- MAPPO: Multi-agent PPO variant
Meta-RL
- Learns to quickly adapt to new tasks using previous experience
- Key components:
- Meta-learner: Learns general learning strategy
- Task-specific learner: Adapts to specific tasks
- Common approaches:
- Recurrent architectures (RL²)
- Gradient-based meta-learning (MAML)
- Context-based adaptation (PEARL)
- Applications:
- Few-shot task adaptation
- Continuous learning and adaptation
- Transfer learning across domains
Offline RL
- Learns optimal policies from fixed datasets without environment interaction
- Addresses challenges like:
- Distribution shift
- Out-of-distribution actions
- Conservative policy updates
- Key algorithms:
- Conservative Q-Learning (CQL)
- Behavior Regularized Actor Critic (BRAC)
- Implicit Q-Learning (IQL)
- Applications:
- Healthcare
- Robotics from demonstrations
- Industrial process control
Each approach offers distinct advantages and faces unique challenges:
- Sample Efficiency: Model-based methods typically require fewer interactions
- Stability: Offline RL provides stable learning from fixed datasets
- Scalability: Multi-agent methods face exponential complexity with agent count
- Generalization: Meta-RL enables quick adaptation but requires diverse training tasks
- Computational Cost: Hierarchical methods reduce search space but increase architectural complexity