参考书籍:
reinforcement learning : state-of-the-art
an introduction to reinforcement(Sutton)
上一篇博客
https://blog.csdn.net/Jinyindao243052/article/details/107126041

On-policy Approximation of Action Values

为什么想到用神经网络预测Q值函数
We have so far assumed that our estimates of value functions are represented as a table with one entry for each state or for each state-action pair. but it is limited to tasks with small numbers of states and actions. The problem is not just the memory needed for large tables, but the time and data needed to fill them accurately.

强化学习无痛上手笔记第4课
强化学习无痛上手笔记第4课

Gradient-Descent Methods

强化学习无痛上手笔记第4课
Two methods for gradient-based function approximation have been used widely in reinforcement learning. One is multilayer artificial neural networks using the error backpropagation algorithm.
This maps immediately onto the equations and algorithms just given, where the backpropagation process is the way of computing the gradients. The second popular form is the linear form, which we discuss extensively in the next section.

multilayer artificial neural networks

try to minimize error on the observed examples :
强化学习无痛上手笔记第4课
We turn now to the case in which the target output, VtV_t, of the tth training example, StVtS_{t} \mapsto V_{t}, is not the true value, vπ(St)v_{π}(S_{t}),
强化学习无痛上手笔记第4课
GtG_{t}, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:
强化学习无痛上手笔记第4课
强化学习无痛上手笔记第4课
As γ\gamma approaches 1, the objective takes future rewards into account more strongly.

有完整的episode,才能有 Rt+1R_{t+1}…的值,所以 Monte Carlo 方法需要完整的episode
强化学习无痛上手笔记第4课
强化学习无痛上手笔记第4课
强化学习无痛上手笔记第4课

linear form

强化学习无痛上手笔记第4课
强化学习无痛上手笔记第4课

相关文章: