强化学习无痛上手笔记第4课

文章目录

On-policy Approximation of Action Values
Gradient-Descent Methods

multilayer artiﬁcial neural networks
linear form

参考书籍：
reinforcement learning : state-of-the-art
an introduction to reinforcement(Sutton)
上一篇博客
https://blog.csdn.net/Jinyindao243052/article/details/107126041

On-policy Approximation of Action Values

为什么想到用神经网络预测Q值函数
We have so far assumed that our estimates of value functions are represented as a table with one entry for each state or for each state-action pair. but it is limited to tasks with small numbers of states and actions. The problem is not just the memory needed for large tables, but the time and data needed to ﬁll them accurately.

强化学习无痛上手笔记第4课

Gradient-Descent Methods

强化学习无痛上手笔记第4课
Two methods for gradient-based function approximation have been used widely in reinforcement learning. One is multilayer artiﬁcial neural networks using the error backpropagation algorithm.
This maps immediately onto the equations and algorithms just given, where the backpropagation process is the way of computing the gradients. The second popular form is the linear form, which we discuss extensively in the next section.

multilayer artiﬁcial neural networks

try to minimize error on the observed examples :
强化学习无痛上手笔记第4课
We turn now to the case in which the target output, $V_t$ , of the tth training example, $S_{t} \mapsto V_{t}$ , is not the true value, $v_{π}(S_{t})$ ,

$G_{t}$ , is deﬁned as some speciﬁc function of the reward sequence. In the simplest case the return is the sum of the rewards:
强化学习无痛上手笔记第4课

As $\gamma$ approaches 1, the objective takes future rewards into account more strongly.

有完整的episode，才能有 $R_{t+1}$ …的值，所以 Monte Carlo 方法需要完整的episode
强化学习无痛上手笔记第4课

linear form

强化学习无痛上手笔记第4课