李宏毅深度强化学习- Sparse Reward

李宏毅深度强化学习课程 https://www.bilibili.com/video/av24724071

李宏毅深度强化学习笔记(一)Outline
李宏毅深度强化学习笔记(二)Proximal Policy Optimization (PPO)
李宏毅深度强化学习笔记(三)Q-Learning
李宏毅深度强化学习笔记(四)Actor-Critic
李宏毅深度强化学习笔记(六)Imitation Learning
李宏毅深度强化学习课件

Reward Shaping

When the reward is too sparse, it will be quite difficult for the machine to learn how to act.
Thus, reward shaping is needed for leading the actor to do what you want it to do.
Method of curiosity:
【完结】李宏毅深度强化学习笔记(五)Sparse Reward
Add another function ICM which uses st,at,st+1s_t, a_t, s_{t+1} as input, get a new kind of reward rir^i, the objective of the actor changed into maximizing the total reward of r and rir^i.
The ICM(intrinsic curiosity module) is used for generating the actor’s curiosity, and its network is shown as follow:
【完结】李宏毅深度强化学习笔记(五)Sparse Reward
In this model, the network1 and network2 are two networks trained separately.
Network1: takes ata_t and sts_t's extracted feature as input, output an estimated st+1s_{t+1}'s extracted feature. Then true feature extracted from st+1s_{t+1} are compared with the estimated one, get the difference.
The bigger the difference, the larger the reward rir^i, which means the model encourages actor to take risk.
Network2: trained for extracting useful features that are related to actions. Input the ϕ\phi value of sts_t and st+1s_{t+1}, output the estimated action at^\hat{a_t}, if the estimated action is close to true ata_t, then ϕ\phi can extract those useful features.
Tip: when there is no network 2, the large reward given by network1 means that st+1s_{t+1} is hard to predict, the model encourage taking risk, but sometimes the states hard to predict may be unimportant, thus, we need network2.

Curriculum Learning

Means learning tasks from easy to difficult (curriculum schedule for machine)
Reverse Curriculum Generation:
【完结】李宏毅深度强化学习笔记(五)Sparse Reward
Given a goal state sgs_g --> Sample some states s1s_1 “close” to sgs_g --> Start from states s1s_1, each trajectory has reward R(s1s_1) --> Delete s1s_1 whose reward is too large (already learned) or too small (too difficult at this moment) --> Sample s2s_2 from s1s_1, start from s2s_2

Hierarchical Reinforcement Learning

If lower agent cannot achieve the goal, the upper agent would get penalty. (the upper agent send its wish to the lower agent.)
If an agent get to the wrong goal, assume the original goal is the wrong one. (the achieved won’t be wasted)

相关文章: