Reinforcement Learning: value function approximation

introduction

上一节讲到使用采样的方法进行，状态和action space都比较小的情况，现在对于超大规模状态和action的时候，就需要使用近似的方法，让没见过的state也有value，从已知的状态入手。

进行value function approximation的方法有：
Reinforcement Learning: value function approximation

这种拟合可以使用线性拟合，神经网络，决策树，KNN，傅里叶/小波等等。为了进行更新权值w，一般采用可导的方法。

Gradient Descent：使用梯度下降的方法计算更新w的权值。这个就是和linear regression一样，找到最小化mean-squared error。使用stochastic GD通过采样来计算梯度。

这里面有涉及一个linear regression来拟合state value function，非常简单，就是linear regression一样进行更新权值。

Table Lookup Features：就是一个查找表，什么状态对应输出的feature的值应该是多少。然后w正好也是n个，对应n个state来表示。

incremental prediction algorithms：这个东西就比较简单了，就是在上面说到的linear regression的权值更新的地方的true value改成在不同方法下MC，TD(0)，TD(λ)下的估计的value。

Reinforcement Learning: value function approximation

拟合的目标是：
Reinforcement Learning: value function approximation

继续使用linear regression进行优化。针对把true action value替换成不同方法下估计的value可以得到不同方法下的迭代器：
Reinforcement Learning: value function approximation

对大规模的数据进行处理，更新权值。比如同时对T个samples进行处理，优化目标是least squares：
Reinforcement Learning: value function approximation

使用linear least square进行prediction的时候，因为是闭包的，所以可以直接计算w：
Reinforcement Learning: value function approximation

DQN：步骤如下：（还需要好好揣摩一下）
Reinforcement Learning: value function approximation

接下来分别对prediction和control在不同算法下如何计算w有了一系列的计算，这些东西只要真正码代码的时候才能有所感觉。这个slide就这样走马观花的过去，过了一遍再来一次，应该会有不一样的认识。