In this blog, we delve into the mechanics of differentiable simulators and explore why they are sometimes a more advantageous choice compared to reinforcement learning (RL) methods.
What is a Differentiable Simulator?
Imagine a robot in an environment where, in each state , the agent can execute an action leading to a subsequent state . We can describe this transition with the function . In conventional non-differentiable simulators, this function is often treated as a black box, with observed rewards serving as the primary signal for RL-based learning.
Contrastingly, in differentiable simulators, the function is perceived as an end-to-end differentiable operator. This implies that if we define some loss related to the output state , it becomes feasible to compute the gradient in relation to the input state and actions, such as and . This capability enables the optimization of an entire sequence of actions using the chain rule.
Why are Differentiable Simulators Effective for Policy Learning?
As Yann LeCun insightfully noted at NeuIPS 2016, "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning" (source). The key takeaway here is the direct supervisory gradient flow in relation to the actions we aim to optimize, in contrast to the indirect reward-based approach of the REINFORCE algorithm in RL.
For instance, consider a task where the goal is to maneuver an object to a target position using a pusher. In an RL scenario, this would require extensive exploration, potentially involving thousands of trials before significant progress is made. Conversely, in a differentiable simulator, each iteration of gradient descent inherently contributes to progress toward the goal.
While differentiable simulators present certain advantages over RL, they are not without their limitations and challenges, such as:
Invalid Gradients in Certain Scenarios: Consider a scenario involving a rigid rolling pin used to flatten dough. If the initial sequence of actions fails to make contact with the dough, the resulting gradient will be zero throughout. To address this, some studies, like PlasticineLab, suggest incorporating a contact loss based on proximity to the target object. Others propose 'softening' rigid tools by increasing their influence radius, allowing objects to be affected without direct contact.
Limited Efficiency in Long-Horizon Tasks: As discussed in this paper, the dependency of differentiable physics on local gradients poses significant challenges. The loss landscape in these scenarios is often complex and riddled with potentially misleading local optima, which can diminish the reliability of this method for certain tasks.