文章目录
参考书籍:
reinforcement learning : state-of-the-art
an introduction to reinforcement(Sutton)
上一篇博客
https://blog.csdn.net/Jinyindao243052/article/details/107126041
On-policy Approximation of Action Values
为什么想到用神经网络预测Q值函数
We have so far assumed that our estimates of value functions are represented as a table with one entry for each state or for each state-action pair. but it is limited to tasks with small numbers of states and actions. The problem is not just the memory needed for large tables, but the time and data needed to fill them accurately.
Gradient-Descent Methods
Two methods for gradient-based function approximation have been used widely in reinforcement learning. One is multilayer artificial neural networks using the error backpropagation algorithm.
This maps immediately onto the equations and algorithms just given, where the backpropagation process is the way of computing the gradients. The second popular form is the linear form, which we discuss extensively in the next section.
multilayer artificial neural networks
try to minimize error on the observed examples :
We turn now to the case in which the target output, , of the tth training example, , is not the true value, ,
, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:
As approaches 1, the objective takes future rewards into account more strongly.
有完整的episode,才能有 …的值,所以 Monte Carlo 方法需要完整的episode