【发布时间】:2018-07-07 09:50:11
【问题描述】:
我的老师提出了以下问题: 考虑以下具有 3 个状态和奖励的 MDP。有两种可能的动作 - 红色和蓝色。状态转换概率在边上给出,S2 是终端状态。假设初始策略为:π(S0) = B; π(S1) = R。
我们被问及最佳策略的 γ 值 (0
(a) π∗(S0) = R; π∗(S1) = B;
(b) π∗(S0) = B; π∗(S1) = R;
(c) π∗(S0) = R; π∗(S1) = R;
我已经证明,对于 (a),答案是 γ = 0.1,并且找不到 (b) 和 (c) 的 γ 值。老师说对于 (b) 任何 γ > 0.98 都可以,对于 (c) γ = 0.5。我认为他错了,并且写了the following python script,它遵循教科书(Russell 和 Norvig AIMA)中的算法,实际上对于任何 γ 值,我得到的唯一策略是(a)。但是老师说他没有错,而且我的脚本一定有问题。我怎么能肯定地证明这样的政策是不可能的?
S0 = "S0"
S1 = "S1"
S2 = "S2"
BLUE = "blue"
RED = "red"
gamma = 0.5 # TODO MODIFY GAMMA HERE
# P(s'|s,a)
P_destination_start_action = \
{
(S0,S0, BLUE):0.5,(S0,S0,RED):0.9, (S0,S1,BLUE):0.8,(S0,S1,RED):0, (S0,S2, BLUE):0,(S0,S2,RED):0,
(S1,S0, BLUE):0.5,(S1,S0,RED):0, (S1,S1,BLUE):0.2,(S1,S1,RED):0.6, (S1,S2, BLUE):0,(S1,S2,RED):0,
(S2,S0, BLUE):0, (S2,S0,RED):0.1, (S2,S1,BLUE):0 ,(S2,S1,RED):0.4,(S2,S2, BLUE):1,(S2,S2,RED):1
}
class MDP:
def __init__(self):
self.states = [S0, S1, S2]
self.actions = [BLUE, RED]
self.P_dest_start_action = P_destination_start_action
self.rewards = {S0: -2, S1: -5, S2: 0}
def POLICY_EVALUATION(policy_vec, utility_vec, mdp):
new_utility_vector = {}
for s in mdp.states:
to_sum = [(mdp.P_dest_start_action[(s_tag, s, policy_vec[s])] * utility_vec[s_tag])
for s_tag in mdp.states]
new_utility_vector[s] = mdp.rewards[s] + gamma * sum(to_sum)
return new_utility_vector
def POLICY_ITERATION(mdp):
utility_vector = {state: 0 for state in mdp.states}
policy_vector = {S0: BLUE, S1: RED, S2: RED}
unchanged = False
while not unchanged:
utility_vector = POLICY_EVALUATION(policy_vector, utility_vector, mdp)
unchanged = True
for s in mdp.states:
BLUE_sum = sum([(mdp.P_dest_start_action[(s_tag, s, BLUE)] * utility_vector[s_tag])
for s_tag in mdp.states])
RED_sum = sum([(mdp.P_dest_start_action[(s_tag, s, RED)] * utility_vector[s_tag])
for s_tag in mdp.states])
if policy_vector[s] == RED and BLUE_sum > RED_sum:
policy_vector[s] = BLUE
unchanged = False
elif policy_vector[s] == BLUE and RED_sum > BLUE_sum:
policy_vector[s] = RED
unchanged = False
return policy_vector
if __name__ == "__main__":
Q2_mdp = MDP()
new_policy_vec = POLICY_ITERATION(Q2_mdp)
print("===========================END===============================")
print("S_O policy =", new_policy_vec[S0], " ,S_1 Policy =", new_policy_vec[S1])
【问题讨论】:
标签: python machine-learning reinforcement-learning markov