强化算法似乎在学习，但脚本卡住了，代理没有重置答案

【问题标题】：Reinforcement algorithm seems to learn but script is getting stuck and agent is not resetting强化算法似乎在学习，但脚本卡住了，代理没有重置
【发布时间】：2018-05-21 17:02:22
【问题描述】：

目前正在研究使用 Q 表和海龟图形的强化算法。代理在 6 个方格的网格内，需要到达最右侧作为其目标。我已经构建了这个，然后我运行我的算法以便代理学习。我面临以下问题。剧本最终卡住了，结果我似乎只能经历一集。代理（蓝色标记）在 0,0 坐标标记周围闪烁，尽管我已为其设置了特定坐标。最后，代理基本上会留下其步骤的痕迹。我的逻辑似乎很好，但无法确定导致这些问题的原因

""" Basic Reinforcement Learning environment using Turtle Graphics """

#imported libraries required for this project
import turtle
import pandas as pd
import numpy as np
import time
#import numpy as np


""" Environment """

#initialise the screen using a turtle object
wn = turtle.Screen()
wn.bgcolor("black")
wn.title("Basic_Reinforcement_Learning_Environment")
#wn.bgpic("game_background.gif")

#this function initializes the 2D environment
def grid(size): 
    #this function creates one square
    def create_square(size,color="white"):
        greg.color(color)
        greg.pd()
        for i in range(4):
            greg.fd(size)
            greg.lt(90)
        greg.pu()
        greg.fd(size)
    #this function creates a row of sqaures based on simply one square
    def row(size,color="white"):
            for i in range(6):
                create_square(size)
            greg.hideturtle()

    row(size)       

greg = turtle.Turtle()
greg.speed(0)
greg.setposition(-150,0)
grid(50)


def player_set(S):
    player = turtle.Turtle()
    player.color("blue")
    player.shape("circle")
    player.penup()
    player.speed(0)
    player.setposition(S)
    player.setheading(90)

N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 13   # maximum episodes
FRESH_TIME = 0.3    # fresh time for one move

#this functions builds a Q-table and initializes all values to 0
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    # print(table)    # show table
    return table

def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    # act non-greedy or state-action have no value
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()): 
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        # replace argmax to idxmax as argmax means a different function 
        action_name = state_actions.idxmax()    
    return action_name


def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

def update_env(S, episode, step_counter):            
    coords = [(-125,25),(-75,25),(-25,25),(25,25),(75,25),(125,25)]

    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' %(episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r', end='')
    else:
        player_set(coords[S])
        time.sleep(FRESH_TIME)


def rl():
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S,A)
            q_predict = q_table.loc[S,A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max() 
            else:
                q_target = R
                is_terminated = True

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)
            S = S_
            update_env(S, episode, step_counter+1)
            step_counter += 1
        return q_table

rl()

更改：更新了 return 语句，算法现在可以运行，因此它可以通过 13 集！！！但是，我似乎无法实现玩家令牌（代理），因此它不会留下所有已采取的步骤的痕迹，我希望它在每集之后重置。这可能与范围有关：

最终解决方案：

""" Basic Reinforcement Learning environment using Turtle Graphics """

#imported libraries required for this project
import turtle
import pandas as pd
import numpy as np
import time
#import numpy as np


""" Environment """

#initialise the screen using a turtle object
wn = turtle.Screen()
wn.bgcolor("black")
wn.title("Basic_Reinforcement_Learning_Environment")
#wn.bgpic("game_background.gif")

#this function initializes the 2D environment
def grid(size): 
    #this function creates one square
    def create_square(size,color="white"):
        greg.color(color)
        greg.pd()
        for i in range(4):
            greg.fd(size)
            greg.lt(90)
        greg.pu()
        greg.fd(size)
    #this function creates a row of sqaures based on simply one square
    def row(size,color="white"):
            for i in range(6):
                create_square(size)
            greg.hideturtle()

    row(size)       

greg = turtle.Turtle()
greg.speed(0)
greg.setposition(-150,0)
grid(50)

player = turtle.Turtle()
player.color("blue")
player.shape("circle")
player.penup()
player.speed(0)
player.setheading(90)

def player_set(S):
    player.setposition(S)



N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 13   # maximum episodes
FRESH_TIME = 0.3    # fresh time for one move

#this functions builds a Q-table and initializes all values to 0
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    # print(table)    # show table
    return table

def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    # act non-greedy or state-action have no value
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()): 
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        # replace argmax to idxmax as argmax means a different function 
        action_name = state_actions.idxmax()    
    return action_name


def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

def update_env(S, episode, step_counter):            
    coords = [(-125,25),(-75,25),(-25,25),(25,25),(75,25),(125,25)]

    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' %(episode+1, step_counter)
        print('\n{}'.format(interaction), end='')
        time.sleep(2)
        print('\r', end='')
    else:
        player_set(coords[S])
        time.sleep(FRESH_TIME)


def rl():
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S,A)
            q_predict = q_table.loc[S,A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max() 
            else:
                q_target = R
                is_terminated = True

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)
            S = S_
            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table

rl()

【问题讨论】：

哈，我知道这段代码看起来像familiar。希望你最终能成功！
是的，感觉以前的版本太复杂了所以从只能左右移动的代理开始。但是在重置代理时遇到问题！

标签： python algorithm turtle-graphics reinforcement-learning

【解决方案1】：

在从您的问题中复制的以下代码 sn-p 中：

def rl():
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            # ...
            # <snip> 
            # ...
        return q_table

您的rl() 函数在while 循环遍历单个剧集的时间步之后有一个返回语句，在for 循环遍历剧集。这意味着您的函数将只有效地完成一个情节，然后在它有机会开始第二情节之前已经return（意味着rl() 函数被终止）。

关于这个问题的更新：

更改：更新了 return 语句，算法现在可以运行，因此它可以通过 13 集！！！但是，我似乎无法实现玩家令牌（代理），因此它不会留下所有已采取的步骤的痕迹，我希望它在每集之后重置。这可能与范围有关

我不是 100% 确定，因为我不熟悉 turtle-graphics 框架。但是，我确实注意到update_env() 被实现为在需要更新播放器位置时调用player_set(coords[S])。该函数有以下实现：

def player_set(S):
    player = turtle.Turtle()
    player.color("blue")
    player.shape("circle")
    player.penup()
    player.speed(0)
    player.setposition(S)
    player.setheading(90)

在我看来，该函数在每次调用时都会在新位置创建一个全新的 player 对象，而不是更新已经存在的 player 对象的位置。因此，每当状态更新时，看起来就像是创建了一个全新的 player 对象，而旧的 player 对象仍将保留在原来的位置。解决方案可能包括只创建一次player 对象，然后创建一个单独的函数来更新其位置，而无需再次创建新对象。

【讨论】：

非常感谢您发现这一点！我现在已经改变了我在原始问题中更新的问题的其余部分的任何想法。谢谢
@HimansuOdedra 我已将我怀疑是导致该问题的原因编辑为答案。不过我不确定，因为我个人不熟悉turtle-graphics。
太好了，我认为这就是问题所在！将尝试根据您的逻辑找到可行的解决方案
嘿使用了你所说的并用解决方案更新了原始问题。最后谢谢你！