Lecture1: Introduction to Reinforcement Learning

课程视频连接：https://www.bilibili.com/video/av45357759
课程配套资料：http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Admin

参考书

An Introduction to Reinforcement Learning
RL圣经，比较偏理论

Algorithms for Reinforcement Learning, Szepesvari
页数较少

Abour Reinforcement Learning

Lecture1: Introduction to Reinforcement Learning
强化学习的特点

没有监督的，只有reward信号
反馈可能是延迟的
时间非常重要，是序列数据，不是iid
agent的动作会影响后续的数据

The Reinforcement Learning Problem

Reward

reward $R_t$ 是一个标量反馈信号，它表面了在t步时agent的action怎么样，agent的任务就是累计reward最大化

Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected cumulative reward

Environments

Lecture1: Introduction to Reinforcement Learning
Agent 根据observation和reward，执行action。
下图为Agent与Environment的交互。

在t步的时候：Agent：
- 收到 observation： $O_t$
- 收到标量reward： $R_t$
- 执行action： $A_t$
Environment:
- 收到agent执行的action： $A_t$
- 给出observation： $O_{t+1}$
- 给出标量reward： $R_{t+1}$

History

history 是observations，actions，rewards的序列：
$H_t = O_1,R_1,A_1,\dots,A_{t-1},O_t,R_t$
agent会根据history选择下一步的action，然后environment就会相应的反馈observations/rewards。
history可以看作agent的经验，RL就是根据这个数据流（history）进行学习的。

State

history通常非常巨大，难以用于计算，所以我们通常是是研究state。state可以看成是history的替代，state包含了我们所需要的信息，决定了下一步的action。
State是关于history的函数：
$S_t = f(H_t)$

Environment State $S_t^e$

Environment state决定了下一步环境会怎样，会反馈给什么样的observation/reward。
$S_t^e$ 通常并不是agent所见的
$S_t^e$ 可能会包含很多无效信息

Agent State $S_t^a$

$S_t^a$ 决定了下一步agent采取什么样的action
$S_t^a$ 包含RL算法所用的信息
$S_t = f(H_t)$

Information State (Markov State)

Information state 通常包含history的所有有用的信息。

Markov property
$\mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1,\dots,S_t]$
当前的状态只与上一个状态有关，与之前的状态无关。

假设这里的history $H_t$ 具有Markov性质，只要当前的state可以知道，那么我们就可以抛弃history了，因为我们可以用state代替整个history。
$H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\infin}$
Example
Lecture1: Introduction to Reinforcement Learning

Fully Observable Environment

$O_t = S_t^a = S_t^e$

agent state = environment state = information state
通常是Markov decision process (MDP)

Partially Observable Environment

agent 只能观察到环境的一部分，也就是observation

agent state = environment state
这是Partially observable Markov decision process (POMDP)
agent 需要构建自己的state $S_t^a$ , e.g.
- Conplete history: $S_t^a = H_t$
- Belief of the environment state: $S_t^a = (\mathbb{P}[S_t^e=s^1],\cdots,\mathbb{P}[S_t^e=s^n])$
- RNN: $S_t^a = \sigma(S_{t-1}^aW_s+O_tW_t)$

Inside An RL Agent

Policy:agent’s behaviour function

从状态到行为的映射。

Deterministic policy

$a = \pi (s)$

Stochastic policy

$\pi(a|s) = \mathbb{P}[A_t=a|S_t=s]$

Value function: how good is each state and/or action

value function 是对未来reward的预测
$v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1}+\gamma R_{t+2} + \gamma ^2 R_{t+3} + \cdots | S_t = s]$

Model:agent’s representation of the environment

model 并不是环境本身，model学习环境的行为，可以预测下一步环境会做什么

Transition model

预测下一个state
$P_{ss'}^a = \mathbb{P}[S_{t+1} = s' | S_t = s,A_t = a]$

Reward model

预测下一个reward
$R_{s}^a = \mathbb{E}[R_{t+1} | S_t = s,A_t = a]$

强化学习agent分类(1)

value Based
- No Policy(Implicit)
- Value Function
Policy Based
- Policy
- No Value Function
Actor Critic
- Policy
- Value Function
强化学习agent分类(2)
- Model Free
  无模型的方法不需要知道环境是如何改变的，只需要关注Policy 和/或者 Value Function 即可。
  - Policy and/or Value Function
  - No Model
- Model Based
  基于模型的方法第一步通常是建立模型，预测环境的变化
  - Policy and/or Value Function
  - Model
    ![在这里插入图片描述](https://img-blog.csdnimg.cn/2020031009480599.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0RyZWFtX3hk,size_5 ,color_FFFFFF,t_70)

Problems with Reinforcement Learning

Learning and Planning

序列决策包含2中基本的方法

Reinforcement Learning

环境是未知的
agent与environment进行交互
agent通过自己与环境的交互提升policy
下图Atati游戏是RL的一个例子。在Atari游戏中，游戏的规则agent是未知的，agent通过不断的试错，然后找出模拟器的工作原理。

Planning

环境的模型是已知的，agent了解游戏规则
agent通常不需要与环境进行交互，只需要一点时间思考，因为它可以直接通过model决定下一步的行为
agent提升自己的policy
在planning中，agent会被告知游戏规则，agent可以通过规则，搜索出每个行为之后的reward，根据最大化reward的原则选择一个action。

Exploration and Exploitation

exploration 可以探索环境的更多信息
exploitation可以更好的利用当前信息，最大化reward

Prediction and Control

Prediction:在给定policy的情况下，预测未来的reward
Control：找到最优的policy，优化未来的reward

文章目录