The awkward Bellman optimality equation in RL

通过博文2017 Fall CS294 Lecture 6: Actor-critic introduction，一文中插播的Reinforcement Learning: An introduction(Sutton1998)书中的一页截图，对于
Vπ(s): the state-value function for policy π.
Qπ(s,a): the action-value function for policy π.
这两个概念应该有了深刻的理解。

我们接着定义optimal policy下的Vπ(s),Qπ(s,a)分别为optimal value function, V∗(s)和optimal action-value function, Q∗(s,a)。

接着可以很容易地得到：
Bellman equation for V∗，也叫Bellman optimality equation
The awkward Bellman optimality equation in RL

Bellman optimality equation for Q∗
The awkward Bellman optimality equation in RL

利用上面两个equation（for V∗ or Q∗）中的任意一个都可以列出一个N元的方程组，这里的N表示的是state的数目。然后解这个方程组就可以解出对应的RL problem了。如下所述（直接从原书截图过来了）：

The awkward Bellman optimality equation in RL

但是！直接通过解bellman optimality equation来求解RL problem虽说看起来简单，实际上却很难真正派上用场。原因是：

The awkward Bellman optimality equation in RL

所以，这就是为什么这篇博文的名字叫做“awkward Bellman optimality equation”的原因~