通过博文2017 Fall CS294 Lecture 6: Actor-critic introduction,一文中插播的Reinforcement Learning: An introduction(Sutton1998)书中的一页截图,对于
Vπ(s): the state-value function for policy π.
Qπ(s,a): the action-value function for policy π.
这两个概念应该有了深刻的理解。

我们接着定义optimal policy下的Vπ(s),Qπ(s,a)分别为optimal value function, V(s)optimal action-value function, Q(s,a)

接着可以很容易地得到:
Bellman equation for V,也叫Bellman optimality equation
The awkward Bellman optimality equation in RL

Bellman optimality equation for Q
The awkward Bellman optimality equation in RL

利用上面两个equation(for V or Q)中的任意一个都可以列出一个N元的方程组,这里的N表示的是state的数目。然后解这个方程组就可以解出对应的RL problem了。如下所述(直接从原书截图过来了):

The awkward Bellman optimality equation in RL
The awkward Bellman optimality equation in RL

但是!直接通过解bellman optimality equation来求解RL problem虽说看起来简单,实际上却很难真正派上用场。原因是:

The awkward Bellman optimality equation in RL
The awkward Bellman optimality equation in RL

所以,这就是为什么这篇博文的名字叫做“awkward Bellman optimality equation”的原因~

相关文章: