蒙特卡洛树搜索：井字游戏的实现答案

【问题标题】：Monte Carlo Tree Search: Implementation for Tic-Tac-Toe蒙特卡洛树搜索：井字游戏的实现
【发布时间】：2014-07-11 06:37:36
【问题描述】：

编辑：如果您想看看是否可以让 AI 表现得更好，请上传完整的源代码：https://www.dropbox.com/s/ous72hidygbnqv6/MCTS_TTT.rar

编辑：搜索搜索空间并找到导致损失的移动。但是由于 UCT 算法，导致损失的移动并不经常被访问。

为了了解 MCTS（蒙特卡洛树搜索），我使用该算法为经典的井字游戏制作 AI。我使用以下设计实现了算法：

树策略基于 UCT，默认策略是执行随机移动直到游戏结束。我在实施过程中观察到，计算机有时会做出错误的动作，因为它无法“看到”特定的动作会直接导致损失。

例如：请注意动作 6（红色方块）的价值如何略高于蓝色方块，因此计算机标记了这个点。我认为这是因为游戏策略基于随机移动，因此很有可能人类不会在蓝色框中输入“2”。如果玩家没有在蓝色框中输入 2，则计算机将获胜。

我的问题

1) 这是 MCTS 的已知问题还是实施失败的结果？

2) 可能的解决方案是什么？我正在考虑将动作限制在选择阶段，但我不确定:-)

核心 MCTS 的代码：

    //THE EXECUTING FUNCTION
    public unsafe byte GetBestMove(Game game, int player, TreeView tv)
    {

        //Setup root and initial variables
        Node root = new Node(null, 0, Opponent(player));
        int startPlayer = player;

        helper.CopyBytes(root.state, game.board);

        //four phases: descent, roll-out, update and growth done iteratively X times
        //-----------------------------------------------------------------------------------------------------
        for (int iteration = 0; iteration < 1000; iteration++)
        {
            Node current = Selection(root, game);
            int value = Rollout(current, game, startPlayer);
            Update(current, value);
        }

        //Restore game state and return move with highest value
        helper.CopyBytes(game.board, root.state);

        //Draw tree
        DrawTree(tv, root);

        //return root.children.Aggregate((i1, i2) => i1.visits > i2.visits ? i1 : i2).action;
        return BestChildUCB(root, 0).action;
    }

    //#1. Select a node if 1: we have more valid feasible moves or 2: it is terminal 
    public Node Selection(Node current, Game game)
    {
        while (!game.IsTerminal(current.state))
        {
            List<byte> validMoves = game.GetValidMoves(current.state);

            if (validMoves.Count > current.children.Count)
                return Expand(current, game);
            else
                current = BestChildUCB(current, 1.44);
        }

        return current;
    }

    //#1. Helper
    public Node BestChildUCB(Node current, double C)
    {
        Node bestChild = null;
        double best = double.NegativeInfinity;

        foreach (Node child in current.children)
        {
            double UCB1 = ((double)child.value / (double)child.visits) + C * Math.Sqrt((2.0 * Math.Log((double)current.visits)) / (double)child.visits);

            if (UCB1 > best)
            {
                bestChild = child;
                best = UCB1;
            }
        }

        return bestChild;
    }

    //#2. Expand a node by creating a new move and returning the node
    public Node Expand(Node current, Game game)
    {
        //Copy current state to the game
        helper.CopyBytes(game.board, current.state);

        List<byte> validMoves = game.GetValidMoves(current.state);

        for (int i = 0; i < validMoves.Count; i++)
        {
            //We already have evaluated this move
            if (current.children.Exists(a => a.action == validMoves[i]))
                continue;

            int playerActing = Opponent(current.PlayerTookAction);

            Node node = new Node(current, validMoves[i], playerActing);
            current.children.Add(node);

            //Do the move in the game and save it to the child node
            game.Mark(playerActing, validMoves[i]);
            helper.CopyBytes(node.state, game.board);

            //Return to the previous game state
            helper.CopyBytes(game.board, current.state);

            return node;
        }

        throw new Exception("Error");
    }

    //#3. Roll-out. Simulate a game with a given policy and return the value
    public int Rollout(Node current, Game game, int startPlayer)
    {
        Random r = new Random(1337);
        helper.CopyBytes(game.board, current.state);
        int player = Opponent(current.PlayerTookAction);

        //Do the policy until a winner is found for the first (change?) node added
        while (game.GetWinner() == 0)
        {
            //Random
            List<byte> moves = game.GetValidMoves();
            byte move = moves[r.Next(0, moves.Count)];
            game.Mark(player, move);
            player = Opponent(player);
        }

        if (game.GetWinner() == startPlayer)
            return 1;

        return 0;
    }

    //#4. Update
    public unsafe void Update(Node current, int value)
    {
        do
        {
            current.visits++;
            current.value += value;
            current = current.parent;
        }
        while (current != null);
    }

【问题讨论】：

我不明白将 C * Math.Sqrt((2.0 * Math.Log((double)current.visits)) / (double)child.visits) 添加到 UCB 行的基本原理。这个词是干什么用的？如果你只是删除这部分会发生什么？
这是根据：cameronius.com/cv/mcts-survey-master.pdf（第 9 页）- BestChild 编码的。如果我删除它，人工智能仍然会执行“愚蠢”的动作。
论文提到该算法适用于“深度受限的极小极大搜索”。在 minimax 中，你对你的移动和对手应用相同的分数启发式。我从来没有听说过一个人工智能会假设它正在与一个随机移动的对手对抗。
Groo：如果我理解正确的话，Monte Carlo Tree Search 不使用 heutistics（它可以用于诸如 go 等领域知识难以指定的游戏中）。在推出阶段，使用特定策略来模拟游戏，这通常是（再次，如果我理解算法正确的话）随机移动
这是在 github 上的任何地方吗？

标签： c# algorithm artificial-intelligence tic-tac-toe montecarlo

【解决方案1】：

我认为您的答案不应被标记为已接受。对于井字游戏，搜索空间相对较小，应在合理的迭代次数内找到最佳动作。

看起来您的更新函数（反向传播）向不同树级别的节点添加了相同数量的奖励。这是不正确的，因为当前玩家的状态在不同的树级别上是不同的。

我建议你从这个例子中看看 UCT 方法中的反向传播： http://mcts.ai/code/python.html

您应该根据前一个玩家在特定级别计算的奖励更新节点的总奖励（示例中为node.playerJustMoved）。

【讨论】：

【解决方案2】：

好的，我通过添加代码解决了问题：

        //If this move is terminal and the opponent wins, this means we have 
        //previously made a move where the opponent can always find a move to win.. not good
        if (game.GetWinner() == Opponent(startPlayer))
        {
            current.parent.value = int.MinValue;
            return 0;
        }

我认为问题在于搜索空间太小。这确保即使选择确实选择了实际上是终端的移动，也永远不会选择此移动，而是使用资源来探索其他移动:)。

现在 AI 与 AI 总是打平，而 AI 无法像人类一样被击败 :-)

【讨论】：

此页面顶部的链接已失效。您可以将整个项目上传到某处并分享新链接吗？我正计划学习您的示例，然后将其扩展为为纸牌游戏创建 AI。
你可以在这里下载它：drive.google.com/file/d/0B6Fm7aj1SzBlWGI4bXRzZXBJNTA/…（不久前从 DropBox 切换到 Google Drive）即使我让 AI 工作，我也不确定我是否能完全按照到 MCTS。如果你在人工智能方面取得了进展，如果你能分享它（或指出我的错误:-)），我将不胜感激
AI 似乎表现不佳。考虑以下板： 1-0-2 2-0-0 1-0-0 Compute P1 现在应该在右下角的单元格中产生1，但它建议它用于中右。你已经知道的东西？或者这是没有按照您的建议正确实施 MCTS 的结果？
从2014年开始就没碰过代码了，怕是记不住也没时间看。但如果你确实创造了更好的 AI，请告诉我，我会更新这篇文章。

【解决方案3】：

因此，在任何基于随机的启发式算法中，您都可能根本不搜索游戏空间的代表性样本。例如。从理论上讲，您可以对完全相同的序列进行 100 次随机抽样，完全忽略丢失的相邻分支。这使它与试图找到每一步的更典型的搜索算法不同。

但是，这更有可能是一个失败的实现。 tick tack to 的游戏树不是很大，大约是 9 个！在第一步，并且迅速缩小，因此树搜索不可能在每一步都搜索合理的迭代次数，因此应该找到一个最优的移动。

没有你的代码，我真的无法提供进一步的评论。

如果我要猜的话，我会说也许你是根据最大的胜利次数来选择动作，而不是最大的胜利分数，因此通常会偏向于移动的选择搜索次数最多。

【讨论】：

感谢您的回复。如果您想查看，我已将代码添加到帖子中。搜索空间（以及可能导致丢失的移动）在树中被识别，但由于用于选择的 UCT 算法，它们并不经常被访问。使用前面的示例可以看到这个展开的树：dropbox.com/s/muwew62f7edaszw/ttt2.png。执行动作 3 可以导致人类选择动作 2，从而产生 0 值。但它也可能导致行动 5,6 或 8，从而产生更多价值。注意动作 2 只被访问了 10 次。

【解决方案4】：

我的第一个猜测是，您的算法工作方式会选择最有可能赢得比赛的步骤（在端节点中获胜最多）。

如果我是正确的，您的示例显示 AI“失败”，因此不是“错误”。这种评估移动的方式来自敌人的随机移动。这个逻辑是错误的，因为对于玩家来说，采取哪一步来赢得比赛是显而易见的。

因此，您应该删除所有包含下一个节点的节点，并为玩家赢得胜利。

也许我错了，只是第一个猜测......

【讨论】：

感谢您的回复。因此，如果我理解正确，您的解决方案是消除所有可能导致下一回合（玩家）失败的动作。我也考虑过这个问题，但我想要一些更巧妙的东西:-)
我通常不是那种讲太理论的人，但我会考虑的：）这是一个非常有趣的问题！