The topic we choose

N-step bootstrapping in actor-critic methods.

Motivation and research question

In this project, we study n-step bootstrapping in actor critic methods, more specific, we study advantage actor critic (A2C).

N-step bootstrapping

N-step bootstrapping, or TD(N) is a very important technique in Reinforcement Learning that performs update based on intermediate number of rewards. In this view, N-step bootstrapping unifies and generalizes the Monte Carlo (MC) methods and Temporal Difference (TD) methods. From one extreme, when N=1N=1, it is equivalent to TD(1), from another extreme, when N=N=\infty, i.e., taking as many steps as possible until the end of the episode, it becomes MC. As a result, N-step bootstrapping also combines the advantages of Monte Carlo and 1-step TD. Compared to 1-step TD, n-step bootstrapping will converge faster because it bootstraps with more real information and it is freed from the “tyranny of the time step”. Compared to MC, the updates do not have to wait until the end of the episode and it is also more efficient and less variants. In general, when facing different problems / situations, with a suitable N, we could often achieve faster and more stable learning.

N-step Bootstrapping For Advantage Actor-Critic
Figure1: Diagrams for n-step bootstrapping.

Advantage Actor Critic (A2C)

Actor-Critic algorithms are a power families of learning algorithms within the policy-based framework in Reinforcement Learning. It composes of actor, the policy that makes decision and critic, the value function that evaluates if it is a good decision. With the assistant from critic, the actor can usually achieves better performance, such as by reducing gradient variance in vanilla policy gradients. From the GAE paper , John Schulman has unified the framework for advantage estimation, between all the GAE variants, we picked A2C considering the amazing performance of A3C and it is a simplified version of A3C with equivalent performance.

N-step bootstrapping for A2C

A2C is an online algorithm that uses roll-outs of size n + 1 of the current policy to perform a policy improvement step. In order to train the policy-head, an approximation of the policy-gradient is computed for each state of the roll-out (xt+i,at+iπ(xt+i;θπ),rt+i)i=0n\left(x_{t+i}, a_{t+i} \sim \pi\left(\cdot | x_{t+i} ; \theta_{\pi}\right), r_{t+i}\right)_{i=0}^{n}:
θπlog(π(at+ixt+i;θπ))[Q^iV(xt+i;θV)] \nabla_{\theta_{\pi}} \log \left(\pi\left(a_{t+i} | x_{t+i} ; \theta_{\pi}\right)\right)\left[\hat{Q}_{i}-V\left(x_{t+i} ; \theta_{V}\right)\right]
where Q^i\hat{Q}_{i} is an estimation of the return Q^i=j=in1γjirt+j+γniV(xt+n;θV)\hat{Q}_{i}=\sum_{j=i}^{n-1} \gamma^{j-i} r_{t+j}+\gamma^{n-i} V\left(x_{t+n} ; \theta_{V}\right). The gradients j=i
are then added to obtain the cumulative gradient of the roll-out:
i=0nθπlog(π(at+ixt+i;θπ))[Q^iV(xt+i;θV)] \sum_{i=0}^{n} \nabla_{\theta_{\pi}} \log \left(\pi\left(a_{t+i} | x_{t+i} ; \theta_{\pi}\right)\right)\left[\hat{Q}_{i}-V\left(x_{t+i} ; \theta_{V}\right)\right]
A2C trains the value-head by minimising the error between the estimated return and the value i=0n(Q^iV(xt+i;θV))2\sum_{i=0}^{n}\left(\hat{Q}_{i}-V\left(x_{t+i} ; \theta_{V}\right)\right)^{2}. Therefore, the network parameters (θπ,θV)\left(\theta_{\pi}, \theta_{V}\right) are updated after each roll-out as follows:
θπθπ+απi=0nθπlog(π(at+ixt+i;θπ))[Q^iV(xt+i;θV)]θVθVαVi=0nθV[Q^iV(xt+i;θV)]2 \begin{array}{l}{\theta_{\pi} \leftarrow \theta_{\pi}+\alpha_{\pi} \sum_{i=0}^{n} \nabla_{\theta_{\pi}} \log \left(\pi\left(a_{t+i} | x_{t+i} ; \theta_{\pi}\right)\right)\left[\hat{Q}_{i}-V\left(x_{t+i} ; \theta_{V}\right)\right]} \\ {\theta_{V} \leftarrow \theta_{V}-\alpha_{V} \sum_{i=0}^{n} \nabla_{\theta_{V}}\left[\hat{Q}_{i}-V\left(x_{t+i} ; \theta_{V}\right)\right]^{2}}\end{array}
where (απ,αV)\left(\alpha_{\pi}, \alpha_{V}\right) are learning rates are policy-head and value-head.

Experiments

For this project, our main goal is to compare the performance of the n-step bootstrapping variation of the METHODS\textcolor{red}{METHODS} with its Monte-Carlo and 1-step variations. Therefore, we do not seek to deliver an ultimate agent that can solve some complicated fancy games.

The experiments are designed with classical control problems, i.e. InvertedPendulum, CartPole, Acrobot, MountainCar etc. For the sake of convenience in terms of implementation, we use the off-the-shelf environments provided by OpenAIsGymlibrary\href{https://gym.openai.com/}{ OpenAI's Gym library}. Under the category Classical Control\textit{Classical Control}, we picked out two discrete environments: CartPole-v0\textbf{CartPole-v0} and Acrobot-v1\textbf{Acrobot-v1}.

The CartPole-v0\textbf{CartPole-v0} environment contains a pole attached by a an un-actuated joint to a cart in a 2D plane. The cart moves left/right along a frictionless track. The goal is to balance the poll staying upright by indirectly influencing the velocity of the cart. There are 4 continuous observations of this environment, consisting of the Cart Position\textit{Cart Position}, Cart Velocity\textit{Cart Velocity}, Pole Angle\textit{Pole Angle} and Pole Velocity At Tip\textit{Pole Velocity At Tip} within their corresponding range. The available actions consist of two discrete actions Push Left\textit{Push Left} and Push Right\textit{Push Right}, which increases the Cart Velocity\textit{Cart Velocity} by an unknown amount in the corresponding direction. At starting state, all the observations are initialized with a small uniform random value. The termination of each episode is defined by the pole angle’s or the cart position’s exceeding a certain threshold, or when the episode length exceeds 200 steps. The reward is 1 for every step taken in each episode, indicating the durability of the stabilization by the agent.

The Acrobot-v1\textbf{Acrobot-v1} environment contains a combination of two joints and two links in a 2D plane. Only the joint between the two links is actuated by applying torque. The goal is to swing the end joint above a certain height. There are 6 continous observations of this environment, consisting of the cos\cos and sin\sin of the two joint angles respectively and the joint angular velocities. The available actions consist of three discrete actions Applying +1 torque\textit{Applying +1 torque}, Applying 0 torque\textit{Applying 0 torque} and Applying -1 torque\textit{Applying -1 torque} on the actuated joint. At starting state, the combination is simply hanging down. The termination of each episode is defined by whether the end joint reaches the height threshold. The reward is -1 for every step taken in each episode, encouraging the agent to terminate as soon as possible.

Experiment Design

In this project, we conduct experiments to compare the performance of the n-step bootstrapping variation of the METHODS\textcolor{red}{METHODS} with its Monte-Carlo and 1-step variations.

To decrease the effect of stochasticity, we compare the three types on both environments and for three different random seed numbers. Furthermore, to compare the nuances between using different number of steps (step roll-outs) of the n-step bootstrapping, we also define 6 different step roll-outs given the different termination conditions of two environments, which covers from 1-step to max-step (or equivalently Monte Carlo). We also fix the total number of training steps taken for each method for fair comparison and evaluate the agent every 100 training steps. For evaluation, we run the agent for 50 episodes and record the total rewards per episode, which we later use to compute the mean and standard deviation per evaluation step. The hyperparameter settings for the experiments are shown in the table below.

Table1: Hyperparameter settings of the experiments.
Environment Train Steps Eval Interval Eval steps Random Seeds Steps Roll-out
CartPole-v0 10000 100 50 [42, 36, 15] [1, 10, 40, 80, 150, 200]
Acrobot-v1 10000 100 50 [42, 36, 15] [1, 30, 60, 100, 300, 500]

Results

We can see the results from Figure 2. Firstly, let’s focus on 1-step TD. We can see that for all of the three seeds, 1-step TD fails to learn and achieve stable average rewards of around 10 during learning. As N increases, starting from N=10 (orange line), the agent starts to learn but the behaviors in three different seeds are rather unstable and are not able to converge during or after learning. However, as N continue to increase, in the case of N=40, N=80, N=150, the agent seems to learn fast and show stable convergent behaviors. Finally, for MC, we can see even though MC is able to converge, it shows a lot more variance that N=40, 80 or 150 as we can see the greater fluctuation of the brown line.

CartPole-v0, seed:15N-step Bootstrapping For Advantage Actor-Critic
CartPole-v0, seed:36N-step Bootstrapping For Advantage Actor-Critic
CartPole-v0, seed:42N-step Bootstrapping For Advantage Actor-Critic
Figure2: Results with different seed.

Conclusion

In this project, we investigate how different N for n-step bootstrapping could affect the learning behaviors of the A2C agent. We show that when using N=1, i.e., the TD method, the A2C agents are not capable of learning, while using MC, the A2C agents are capable of learning but show a more volatile and unstable behaviors. After the experiments, in general, we show that n-step bootstrapping achieves a better and superior performance compared to TD and MC with more stability and flexibility and choosing an appropriate N could be vital for different application or problems in Reinforcement Learning.

References

Schulman J , Moritz P , Levine S , et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation[J]. Computer Science, 2015.

相关文章:

  • 2022-12-23
  • 2022-12-23
  • 2021-05-22
  • 2021-05-24
  • 2021-10-25
  • 2021-06-06
  • 2021-07-16
  • 2021-12-25
猜你喜欢
  • 2021-04-10
  • 2021-04-10
  • 2021-05-07
  • 2021-10-05
  • 2021-12-14
  • 2021-12-19
  • 2021-05-05
相关资源
相似解决方案