本文作者:hhh5460
本文地址:https://www.cnblogs.com/hhh5460/p/10134855.html
问题情境
一个2*2的迷宫,一个入口,一个出口,还有一个陷阱。如图
(图片来源:https://jizhi.im/blog/post/intro_q_learning)
这是一个二维的问题,不过我们可以把这个降维,变为一维的问题。
感谢:https://jizhi.im/blog/post/intro_q_learning。网上看了无数文章,无数代码,都不得要领!直到看了这篇里面的三个矩阵:reward,transition_matrix,valid_actions才真正理解q-learning算法如何操作,如何实现!
的代码先睹为快,绝对让你秒懂q-learning算法,当然我也做了部分润色:
import numpy as np import random ''' 2*2的迷宫 --------------- | 入口 | | --------------- | 陷阱 | 出口 | --------------- # 来源:https://jizhi.im/blog/post/intro_q_learning 每个格子是一个状态,此时都有上下左右停5个动作 任务:通过学习,找到一条通径 ''' gamma = 0.7 # u, d, l, r, n reward = np.array([( 0, -10, 0, -1, -1), #0,状态0 ( 0, 10, -1, 0, -1), #1 (-1, 0, 0, 10, -1), #2 (-1, 0, -10, 0, 10)],#3 dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)]) q_matrix = np.zeros((4, ), dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)]) transition_matrix = np.array([(-1, 2, -1, 1, 0), # 如 state:0,action:'d' --> next_state:2 (-1, 3, 0, -1, 1), ( 0, -1, -1, 3, 2), ( 1, -1, 2, -1, 3)], dtype=[('u',int),('d',int),('l',int),('r',int),('n',int)]) valid_actions = np.array([['d', 'r', 'n'], #0,状态0 ['d', 'l', 'n'], #1 ['u', 'r', 'n'], #2 ['u', 'l', 'n']])#3 for i in range(1000): current_state = 0 while current_state != 3: current_action = random.choice(valid_actions[current_state]) # 只有探索,没有利用 next_state = transition_matrix[current_state][current_action] next_reward = reward[current_state][current_action] next_q_values = [q_matrix[next_state][next_action] for next_action in valid_actions[next_state]] #待取最大值 q_matrix[current_state][current_action] = next_reward + gamma * max(next_q_values) # 贝尔曼方程(不完整) current_state = next_state print('Final Q-table:') print(q_matrix)