AI - Reinforcement

MDP Markov Decision Process

MDP (Markov Decision Process)

State: S
Action: A
Tansition Function

T(s,a,s′)=P(St+1=s′,St=s,At=a)

Reward Function

R(s)||R(s,a)||R(s,a,s′)

如果让Initial State做Root，可以用：AND/OR Tree

例子：已知某一种Agent的出现概率如下（i：行；j: 例）：

P 1 （ i ， j ） = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥

根据上面的Transition Function和某个情况归纳出：

T 1 (i, j) = ⎧ ⎩ ⎨ ⎪ ⎪ i < j; 0 i \geq j; P (i, j - i) j = 0; \sum n x = i P (i, x)

当j =0时, 按照上面公式，把紫色区域相加，即为当j = 0时的所有值：
AI - Reinforcement

T1(0,0) = 0.3+0.3+0.2+0.1+0.2=1
T1(1,0) = 0.2+0.2+0.1+0.2 = 0.7
T1(2,0) = 0.2+0.1+0.2 = 0.5
…

T 1 （ i ， j ） = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ 1 0.7 0.5 0.3 0.2 0 0.3 0.2 0.2 0.1 00 0.3 0.2 0.2 000 0.3 0.2 0000 0.3 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥

如果有P1 和 P2:
Current State s1 with action a1 can make agent move to Next State s1′
Current State s2 with action a2 can make agent move to Next State s2′

T (s, a, s') = = T ((s 1, s 2), (a 1, a 2), (s 1', s 2')) T 1 (s 1 + a 1, s 1') \cdot T 2 (s 2 + a 2, s 2')

假设求Sate 1为 2，State 2为1；Action 对应 1 与 2 分别为 1， 0；下一阶段的Sate 1 与 State 2 对应 1，0：

T ((2, 1), (1, 0), (1, 0)) = = = T 1 (2 + 1, 1) \cdot T 2 (1 + 0, 0) T (3, 1) \cdot T 2 (1, 0) 0.6

从T1 的Matrix 找到行(i)=>3,例(j)=>1的对应数字为0.2，假设T2(1,0)=0.3, 则最后上面例子的结果为：0.2⋅0.3=0.6