本文作者:hhh5460

本文地址:https://www.cnblogs.com/hhh5460/p/10134855.html

问题情境

一个2*2的迷宫,一个入口,一个出口,还有一个陷阱。如图

【强化学习】python 实现 q-learning 例二(图片来源:https://jizhi.im/blog/post/intro_q_learning)

 这是一个二维的问题,不过我们可以把这个降维,变为一维的问题。

感谢:https://jizhi.im/blog/post/intro_q_learning。网上看了无数文章,无数代码,都不得要领!直到看了这篇里面的三个矩阵:reward,transition_matrix,valid_actions才真正理解q-learning算法如何操作,如何实现!

 Kaiser的代码先睹为快,绝对让你秒懂q-learning算法,当然我也做了部分润色:

import numpy as np
import random

'''
2*2的迷宫
---------------
| 入口 |      |
---------------
| 陷阱 | 出口 |
---------------
# 来源:https://jizhi.im/blog/post/intro_q_learning

每个格子是一个状态,此时都有上下左右停5个动作

任务:通过学习,找到一条通径
'''

gamma = 0.7

#                    u,   d,   l,  r,  n
reward = np.array([( 0, -10,   0, -1, -1), #0,状态0
                   ( 0,  10,  -1,  0, -1), #1
                   (-1,   0,   0, 10, -1), #2
                   (-1,   0, -10,  0, 10)],#3
                   dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)])

q_matrix = np.zeros((4, ),
                    dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)])

transition_matrix = np.array([(-1,  2, -1,  1, 0), # 如 state:0,action:'d' --> next_state:2
                              (-1,  3,  0, -1, 1),
                              ( 0, -1, -1,  3, 2),
                              ( 1, -1,  2, -1, 3)],
                              dtype=[('u',int),('d',int),('l',int),('r',int),('n',int)])

valid_actions = np.array([['d', 'r', 'n'], #0,状态0
                          ['d', 'l', 'n'], #1
                          ['u', 'r', 'n'], #2
                          ['u', 'l', 'n']])#3


for i in range(1000):
    current_state = 0
    while current_state != 3:
        current_action = random.choice(valid_actions[current_state]) # 只有探索,没有利用
        
        next_state = transition_matrix[current_state][current_action]
        next_reward = reward[current_state][current_action]
        next_q_values = [q_matrix[next_state][next_action] for next_action in valid_actions[next_state]] #待取最大值
        
        q_matrix[current_state][current_action] = next_reward + gamma * max(next_q_values) # 贝尔曼方程(不完整)
        current_state = next_state

print('Final Q-table:')
print(q_matrix)
View Code

相关文章:

  • 2021-06-22
  • 2021-06-14
  • 2021-12-01
  • 2021-10-02
  • 2021-08-02
  • 2021-06-13
  • 2021-06-25
  • 2021-06-04
猜你喜欢
  • 2021-07-10
  • 2021-08-03
  • 2021-07-22
  • 2022-12-23
  • 2022-01-16
  • 2021-12-03
相关资源
相似解决方案