带有注释的 Boost Spirit x3 标记器不起作用答案

【问题标题】：Boost spirit x3 tokenizer with annotation does not work带有注释的 Boost Spirit x3 标记器不起作用
【发布时间】：2021-01-07 14:56:02
【问题描述】：

我最近尝试使用 boost spirit x3 实现一个最简单的 tokenizer。我现在面临的挑战是检索输入流中每个标记的位置。

官网有一个很好的注解教程：https://www.boost.org/doc/libs/develop/libs/spirit/doc/x3/html/spirit_x3/tutorials/annotation.html。但是，它有一些局限性：它基本上解析了一个相同（同质）本质的列表，而在现实生活中通常并非如此。

所以我试图创建具有 2 个本质的标记器：空格（空格序列）和单行注释（以 // 开头，一直持续到行尾）。

请参阅问题末尾的最小示例代码。

但是，我在尝试检索任何标记的位置时遇到了错误。经过一番调试，我发现annotate_position::on_success 句柄将T 类型推导出为boost::spirit::x3::unused_type，但我不知道为什么。

所以，我有几个问题：

我做错了什么？（我知道这不是stackoverflow的风格，但我已经为此苦苦挣扎了几天，没有人可以咨询）。我一直在尝试将实际评论存储为SingleLineComment 和Whitespace 类中的字符串，但没有成功。我怀疑这是因为在解析器中省略了注释和空格字符串，有没有办法解决这个问题？
什么是解析异构结构的最佳实践方法？
我是否应该为此任务使用一些专门的库（即应该使用grammar 类或spirit::lex，但是在 x3 版本中没有这样的东西）
是否有一些标记器的示例（我正在查看Getting started guide for Boost.Spirit?，但它有点过时了）？就目前而言，我认为文档不够广泛，无法立即开始编写一些东西，我正在考虑手动编写标记器。宣传为一个简单的“get set go”库，结果却是一堆复杂的、几乎没有文档记录的模板，我并不完全理解。

这是一个最小的示例代码：

#include <string>
#include <iostream>
#include <functional>
#include <vector>
#include <optional>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>

using namespace std;
namespace x3 = boost::spirit::x3;

struct position_cache_tag;

// copy paste from boost documentation example
struct annotate_position
{
    template <typename T, typename Iterator, typename Context>
    inline void on_success(Iterator const &first, Iterator const &last, T &ast, Context const &context)
    {
        auto &position_cache = x3::get<position_cache_tag>(context).get();
        position_cache.annotate(ast, first, last);
    }
};

struct SingleLineComment : public x3::position_tagged
{
    // no need to store actual comment string,
    // since it is position tagged and
    // we can then find the corresponding
    // iterators afterwards, is this right?
};
struct Whitespace : public x3::position_tagged
{
    // same reasoning
};
// here can be another token types (e.g. MultilineComment, integer, identifier etc.)

struct Token : public x3::position_tagged
{
    // unites SingleLineComment and Whitespace
    // into a single Token class

    enum class Type
    {
        SingleLineComment,
        Whitespace
    };

    std::optional<Type> type; // type field should be set by semantic action
    // std::optional is kind of reinsurance that type will be set

    std::optional<std::variant<SingleLineComment, Whitespace>> data;
    // same reasoning for std::optional
    // this filed might be needed for more complex
    // tokens, which hold additional data
};

// unique on success hook classes
struct SingleLineCommentHook : public annotate_position
{
};
struct WhitespaceHook : public annotate_position
{
};
struct TokenHook : public annotate_position
{
};

// rules
const x3::rule<SingleLineCommentHook, SingleLineComment> singleLineComment = "single line comment";
const x3::rule<WhitespaceHook, Whitespace> whitespace = "whitespace";
const x3::rule<TokenHook, Token> token = "token";

// rule definitions
const auto singleLineComment_def = x3::lit("//") >> x3::omit[*(x3::char_ - '\n')];
const auto whitespace_def = x3::omit[+x3::ascii::space];

BOOST_SPIRIT_DEFINE(singleLineComment, whitespace);

auto _setSingleLineComment = [](const auto &context) {
    x3::_val(context).type = Token::Type::SingleLineComment;
    x3::_val(context).data = x3::_attr(context);
};
auto _setWhitespace = [](const auto &context) {
    x3::_val(context).type = Token::Type::Whitespace;
    x3::_val(context).data = x3::_attr(context);
};

const auto token_def = (singleLineComment[_setSingleLineComment] | whitespace[_setWhitespace]);

BOOST_SPIRIT_DEFINE(token);

int main()
{
    // copy paste from boost documentation example
    using iterator_type = std::string::const_iterator;
    using position_cache = boost::spirit::x3::position_cache<std::vector<iterator_type>>;

    std::string content = R"(// first single line comment

// second single line comment

    )";
    // expect 4 tokens: comment -> whitespace -> comment -> whitespace
    position_cache positions{content.cbegin(), content.cend()};

    std::vector<Token> tokens;
    const auto parser = x3::with<position_cache_tag>(std::ref(positions))[*token];

    auto start = content.cbegin();
    auto success = x3::phrase_parse(start, content.cend(), parser, x3::eps(false), tokens);
    success &= (start == content.cend());

    cout << boolalpha << success << endl;
    cout << "Found " << tokens.size() << " tokens" << endl;

    for (auto &token : tokens)
        cout << (token.type.value() == Token::Type::SingleLineComment ? "comment" : "space") << endl;

    // all good till this point

    // now I want to get a position
    // the following throws
    auto pos = positions.position_of(tokens.front());
}

感谢阅读，期待任何回复！

【问题讨论】：

Spirit 从来没有“开始”。如果您知道 IMO 的最佳点在哪里，那将是一种生产力提升。不要对手写标记化感到难过。 Spirit Lex 从来都不是很受欢迎，实际上即使是 Qi 也很难保持“甜蜜点”。从理论上讲，它旨在通过减少回溯来提高性能，但它通常会使规则复杂化到无关紧要的程度。如果您需要任何帮助来克服最初的“头疼”，我就在这里。我还可以快速回顾一下，以预测 X3 是否符合您的用例。
刚刚记得我之前做了一些关于错误处理与位置标记的文档/挖掘：stackoverflow.com/a/61732124/85371。

标签： c++ boost boost-spirit boost-spirit-x3

【解决方案1】：

on_success 在涉及语义操作时似乎不会发生。

实际上，您对 Ast 节点和变体进行了冗余标记。

你已经可以得到第一个令牌的正确结果，例如

auto pos = positions.position_of(
    std::get<SingleLineComment>(tokens.front().data)));

由于需要静态类型切换，这显然不是很方便。

这是一个非常简化的：

Live On Compiler Explorer

#include <iostream>
#include <iomanip>
#include <variant>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>
namespace x3 = boost::spirit::x3;

struct SingleLineComment{};
struct Whitespace       {};

using Variant = std::variant<SingleLineComment, Whitespace>;

struct Token : Variant, x3::position_tagged {
    using Variant::Variant;
};

namespace {
    struct position_cache_tag;
    namespace Parser {
        struct annotate_position {
            template <typename T, typename Iterator, typename Context>
                inline void on_success(Iterator first, Iterator last, T &ast, Context const &context) const {
                    auto &position_cache = x3::get<position_cache_tag>(context);
                    position_cache.annotate(ast, first, last);
                }
        };

        // unique on success hook classes
        template <typename> struct Hook {}; // no annotate_position mix-in
        template <> struct Hook<Token> : annotate_position   {};

        template <typename T>
        static auto constexpr as = [](auto p, char const* name = typeid(decltype(p)).name()) {
            return x3::rule<Hook<T>, T> {name} = p;
        };

        // rule definitions
        auto singleLineComment = as<SingleLineComment>("//" >> x3::omit[*(x3::char_ - x3::eol)]);
        auto whitespace        = as<Whitespace>       (x3::omit[+x3::ascii::space]);
        auto token             = as<Token>            (singleLineComment | whitespace, "token");
    }
}

int main() {
    using It             = std::string::const_iterator;
    using position_cache = x3::position_cache<std::vector<It>>;

    std::string const content = R"(// first single line comment

// second single line comment

    )";
    position_cache positions{content.begin(), content.end()};

    auto parser = x3::with<position_cache_tag>(positions)[*Parser::token];

    std::vector<Token> tokens;
    if (parse(begin(content), end(content), parser >> x3::eoi, tokens)) {
        std::cout << "Found " << tokens.size() << " tokens" << std::endl;

        for (auto& token : tokens) {
            auto pos = positions.position_of(token);
            std::cout
                << (token.index() ? "space" : "comment") << "\t"
                << std::quoted(std::string_view(&*pos.begin(), pos.size()))
                << std::endl;
        }
    }
}

打印

Found 4 tokens
comment "// first single line comment"
space   "

"
comment "// second single line comment"
space   "

    "

【讨论】：

请注意 std::ref 与 with<> 的非常微妙的删除。这是一个安静的改进（大约 1.61？）。幸运的是，引用语义现在是默认的。
谢谢，现在我知道出了什么问题。我之前在位置标记上看到了您的其他答案，但注意到缺乏语义操作并再次询问。感谢您的帮助和时间:)
我还有一个问题：是否需要as 函数而不是为每个类手动定义规则？我正在尝试以这种方式重写它并得到一个编译错误，即编译BOOST_SPIRIT_DEFINE undefined for this rule时token规则@sehe
不需要as<>。我只是“懒惰”，更喜欢这种方式。反正我没有使用 BOOST_SPIRIT_DEFINE。如果你愿意，你可以：still lazy, of course.
如果你仍然想了解为什么你得到“BOOST_SPIRIT_DEFINE undefined for this rule”，我们可以在聊天中继续这个，你可以显示代码。或者你可以问另一个问题:)