Deep Reinforcement Learning through Policy Optimization (Pieter Abbeel and John Schulman)
Reinforcement learning is the interaction of an Agent with an Environment. The agent performs an action on the environment and the environment provides a state and a reward.
Now replace the agent by a probability distribution pi over actions u conditional on the state s, parameterized by theta. Meaning a policy over actions based on state. Meaning a neural network that maps from x to u (should be s??).
The reason for policy optimization rather than V- or Q-learning. In Q-learning we solve for the action u. If the action space is continuous or high dimensional this can be challenging (e.g. robotic grasp).
The RL landscape. We have policy optimization vs. dynamic programming. For the former we have policy gradients, for the latter we have policy or value iteration.
Now we have different ways to go about the optimization in policy optimization. There are derivative free methods, likelihood ratio policy gradients, natural gradients / trust regions, variance reduction (of the natural gradient?) using value functions, pathwise derivatives, stochastic computation graphs, guided policy search and inverse reinforcement learning.
Cross entropy Method (CEM)
Denote U as the expected cumulative reward given policy pi. This works as follows: for every population member sample parameters theta from a distribution parameterized by mu. Execute roll-outs (action??) with that theta and store the resulting U and the theta. After you’ve run through all population members update mu as the maximum likelihood estimator over the top p% of the population members (I’m guessing top with respect to U??).
Benefits of this method: easy to implement, simple. Works best when number of parameters is small.
Black Box Gradient computation
Use finite difference methods to compute a gradient of U. Problem: can be noisy. Solution 1: average. Solution 2: use the same seed across trials. Example: wind influence on helicopter is stochastic. But we assume the same wind pattern across trials. Real world example is “Learning to Hover” with a remote controlled helicopter.
Likelihood Ratio Policy gradient
We reparameterize with paths tau (state-action pair).
To be continued…