使用来自 Boost.Spirit 的 Lex 和 Qi 在语法规则中使用词法分析器标记属性答案

【问题标题】：Using lexer token attributes in grammar rules with Lex and Qi from Boost.Spirit使用来自 Boost.Spirit 的 Lex 和 Qi 在语法规则中使用词法分析器标记属性
【发布时间】：2016-09-13 11:29:59
【问题描述】：

让我们考虑以下代码：

#include <boost/phoenix.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <algorithm>
#include <iostream>
#include <string>
#include <utility>
#include <vector>

namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;
namespace phoenix = boost::phoenix;

struct operation
{
    enum type
    {
        add,
        sub,
        mul,
        div
    };
};

template<typename Lexer>
class expression_lexer
    : public lex::lexer<Lexer>
{
public:
    typedef lex::token_def<operation::type> operator_token_type;
    typedef lex::token_def<double> value_token_type;
    typedef lex::token_def<std::string> variable_token_type;
    typedef lex::token_def<lex::omit> parenthesis_token_type;
    typedef std::pair<parenthesis_token_type, parenthesis_token_type> parenthesis_token_pair_type;
    typedef lex::token_def<lex::omit> whitespace_token_type;

    expression_lexer()
        : operator_add('+'),
          operator_sub('-'),
          operator_mul("[x*]"),
          operator_div("[:/]"),
          value("\\d+(\\.\\d+)?"),
          variable("%(\\w+)"),
          parenthesis({
            std::make_pair(parenthesis_token_type('('), parenthesis_token_type(')')),
            std::make_pair(parenthesis_token_type('['), parenthesis_token_type(']'))
          }),
          whitespace("[ \\t]+")
    {
        this->self
            += operator_add [lex::_val = operation::add]
            | operator_sub [lex::_val = operation::sub]
            | operator_mul [lex::_val = operation::mul]
            | operator_div [lex::_val = operation::div]
            | value
            | variable [lex::_val = phoenix::construct<std::string>(lex::_start + 1, lex::_end)]
            | whitespace [lex::_pass = lex::pass_flags::pass_ignore]
            ;

        std::for_each(parenthesis.cbegin(), parenthesis.cend(),
            [&](parenthesis_token_pair_type const& token_pair)
            {
                this->self += token_pair.first | token_pair.second;
            }
        );
    }

    operator_token_type operator_add;
    operator_token_type operator_sub;
    operator_token_type operator_mul;
    operator_token_type operator_div;

    value_token_type value;
    variable_token_type variable;

    std::vector<parenthesis_token_pair_type> parenthesis;

    whitespace_token_type whitespace;
};

template<typename Iterator>
class expression_grammar
    : public qi::grammar<Iterator>
{
public:
    template<typename Tokens>
    explicit expression_grammar(Tokens const& tokens)
        : expression_grammar::base_type(start)
    {
        start                     %= expression >> qi::eoi;

        expression                %= sum_operand >> -(sum_operator >> expression);
        sum_operator              %= tokens.operator_add | tokens.operator_sub;
        sum_operand               %= fac_operand >> -(fac_operator >> sum_operand);
        fac_operator              %= tokens.operator_mul | tokens.operator_div;

        if(!tokens.parenthesis.empty())
            fac_operand           %= parenthesised | terminal;
        else
            fac_operand           %= terminal;

        terminal                  %= tokens.value | tokens.variable;

        if(!tokens.parenthesis.empty())
        {
            parenthesised         %= tokens.parenthesis.front().first >> expression >> tokens.parenthesis.front().second;
            std::for_each(tokens.parenthesis.cbegin() + 1, tokens.parenthesis.cend(),
                [&](typename Tokens::parenthesis_token_pair_type const& token_pair)
                {
                    parenthesised %= parenthesised.copy() | (token_pair.first >> expression >> token_pair.second);
                }
            );
        }
    }

private:
    qi::rule<Iterator> start;
    qi::rule<Iterator> expression;
    qi::rule<Iterator> sum_operand;
    qi::rule<Iterator> sum_operator;
    qi::rule<Iterator> fac_operand;
    qi::rule<Iterator> fac_operator;
    qi::rule<Iterator> terminal;
    qi::rule<Iterator> parenthesised;
};


int main()
{
    typedef lex::lexertl::token<std::string::const_iterator, boost::mpl::vector<operation::type, double, std::string>> token_type;
    typedef expression_lexer<lex::lexertl::actor_lexer<token_type>> expression_lexer_type;
    typedef expression_lexer_type::iterator_type expression_lexer_iterator_type;
    typedef expression_grammar<expression_lexer_iterator_type> expression_grammar_type;

    expression_lexer_type lexer;
    expression_grammar_type grammar(lexer);

    while(std::cin)
    {
        std::string line;
        std::getline(std::cin, line);

        std::string::const_iterator first = line.begin();
        std::string::const_iterator const last = line.end();

        bool const result = lex::tokenize_and_parse(first, last, lexer, grammar);
        if(!result)
            std::cout << "Parsing failed! Reminder: >" << std::string(first, last) << "<" << std::endl;
        else
        {
            if(first != last)
                std::cout << "Parsing succeeded! Reminder: >" << std::string(first, last) << "<" << std::endl;
            else
                std::cout << "Parsing succeeded!" << std::endl;
        }
    }
}

它是一个简单的算术表达式解析器，包含值和变量。它是使用expression_lexer 提取令牌，然后使用expression_grammar 解析令牌构建的。

在这么小的情况下使用词法分析器可能看起来有点过头了，而且很可能就是这样。但这是简化示例的成本。另请注意，词法分析器的使用允许使用正则表达式轻松定义标记，同时允许通过外部代码（特别是用户提供的配置）轻松定义它们。使用提供的示例，从外部配置文件中读取令牌定义完全没有问题，例如允许用户将变量从 %name 更改为 $name。

代码似乎运行良好（在带有 Boost 1.61 的 Visual Studio 2013 上进行了检查）。

expression_lexer 具有附加到令牌的属性。我猜他们在编译后就可以工作了。但我真的不知道如何检查。

最终，我希望语法为我构建一个 std::vector，并使用反向波兰符号表示。（其中每个元素都是boost::variant，而不是operator::type 或double 或std::string。）

但问题是我未能在expression_grammar 中使用令牌属性。例如，如果您尝试通过以下方式更改sum_operator：

qi::rule<Iterator, operation::type ()> sum_operator;

你会得到编译错误。我希望这会起作用，因为operation::type 是operator_add 和operator_sub 的属性，因此也是它们的替代属性。而且它仍然没有编译。从assign_to_attribute_from_iterators 中的错误来看，解析器似乎试图直接从输入流范围构建属性值。这意味着它忽略了我在词法分析器中指定的[lex::_val = operation::add]。

改成

qi::rule<Iterator, operation::type (operation::type)> sum_operator;

也没有用。

我也尝试将定义更改为

sum_operator %= (tokens.operator_add | tokens.operator_sub) [qi::_val = qi::_1];

也没有用。

如何解决这个问题？我知道我可以使用 Qi 的symbols。但是我希望词法分析器可以轻松地为令牌配置正则表达式。我也可以按照文档中的描述扩展assign_to_attribute_from_iterators，但是这种工作会加倍。我想我也可以跳过词法分析器上的属性，只在语法上使用它们。但这又不能很好地适应variable 令牌的灵活性（在我的实际情况下，那里的逻辑稍微多一些，因此它也可以配置令牌的哪一部分形成变量的实际名称 - 而这里它固定为跳过第一个字符）。还有什么？

还有一个附带问题 - 也许有人知道。有没有办法从令牌操作中捕获令牌的正则表达式组？所以，而不是拥有

variable [lex::_val = phoenix::construct<std::string>(lex::_start + 1, lex::_end)]

相反，我可以从捕获组中创建一个字符串，从而轻松处理 $var$ 等格式。

已编辑！我已经改进了Whitespace skipper when using Boost.Spirit Qi and Lex 的结论中的空格跳过。这是一种简化，不会影响此处提出的问题。

【问题讨论】：

关于“附带问题 [...] 捕获令牌 [...] 组”- 你不能。它们不是正则表达式。语法非常类似于其中的一个子集，是的。
我会解析成 AST，然后转换成 RPN。有时间我会完成一个例子，现在是：coliru.stacked-crooked.com/a/b01dfd4898103ba5
@sehe 根据您上面的评论和我积累的一些经验，我终于能够回答我自己之前在该领域的问题stackoverflow.com/a/39510064/422489。您可能想看看是否还有其他要添加或扩展的内容。

标签： c++ boost boost-spirit boost-spirit-qi boost-spirit-lex

【解决方案1】：

好的，这是我对 RPN“要求”的看法。我非常喜欢自然（自动）属性传播而不是语义操作（参见Boost Spirit: "Semantic actions are evil"?）

我考虑其他选项（丑化）优化。如果您对整体设计感到满意并且不介意使其更难维护，您可能会这样做:)

Live On Coliru

除了您已经研究过的我的评论示例之外，我还添加了 RPN 转换步骤：

namespace RPN {
    using cell      = boost::variant<AST::operation, AST::value, AST::variable>;
    using rpn_stack = std::vector<cell>;

    struct transform : boost::static_visitor<> {
        void operator()(rpn_stack& stack, AST::expression const& e) const {
            boost::apply_visitor(boost::bind(*this, boost::ref(stack), ::_1), e);
        }
        void operator()(rpn_stack& stack, AST::bin_expr const& e) const {
            (*this)(stack, e.lhs);
            (*this)(stack, e.rhs);
            stack.push_back(e.op);
        }
        void operator()(rpn_stack& stack, AST::value    const& v) const { stack.push_back(v); }
        void operator()(rpn_stack& stack, AST::variable const& v) const { stack.push_back(v); }
    };
}

就是这样！像这样使用它，例如：

RPN::transform compiler;
RPN::rpn_stack program;
compiler(program, expr);

for (auto& instr : program) {
    std::cout << instr << " ";
}

输出结果：

Parsing success: (3 + (8 * 9))
3 8 9 * +

完整列表

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/phoenix.hpp>
#include <boost/bind.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <algorithm>
#include <iostream>
#include <string>
#include <utility>
#include <vector>

namespace lex     = boost::spirit::lex;
namespace qi      = boost::spirit::qi;
namespace phoenix = boost::phoenix;

struct operation
{
    enum type
    {
        add,
        sub,
        mul,
        div
    };

    friend std::ostream& operator<<(std::ostream& os, type op) {
        switch (op) {
            case type::add: return os << "+";
            case type::sub: return os << "-";
            case type::mul: return os << "*";
            case type::div: return os << "/";
        }
        return os << "<" << static_cast<int>(op) << ">";
    }
};

template<typename Lexer>
class expression_lexer
    : public lex::lexer<Lexer>
{
public:
    //typedef lex::token_def<operation::type> operator_token_type;
    typedef lex::token_def<lex::omit> operator_token_type;
    typedef lex::token_def<double> value_token_type;
    typedef lex::token_def<std::string> variable_token_type;

    typedef lex::token_def<lex::omit> parenthesis_token_type;
    typedef std::pair<parenthesis_token_type, parenthesis_token_type> parenthesis_token_pair_type;
    typedef lex::token_def<lex::omit> whitespace_token_type;

    expression_lexer()
        : operator_add('+'),
          operator_sub('-'),
          operator_mul("[x*]"),
          operator_div("[:/]"),
          value("\\d+(\\.\\d+)?"),
          variable("%(\\w+)"),
          parenthesis({
            std::make_pair(parenthesis_token_type('('), parenthesis_token_type(')')),
            std::make_pair(parenthesis_token_type('['), parenthesis_token_type(']'))
          }),
          whitespace("[ \\t]+")
    {
        this->self
            += operator_add [lex::_val = operation::add]
             | operator_sub [lex::_val = operation::sub]
             | operator_mul [lex::_val = operation::mul]
             | operator_div [lex::_val = operation::div]
             | value
             | variable [lex::_val = phoenix::construct<std::string>(lex::_start + 1, lex::_end)]
             | whitespace [lex::_pass = lex::pass_flags::pass_ignore]
             ;

        std::for_each(parenthesis.cbegin(), parenthesis.cend(),
            [&](parenthesis_token_pair_type const& token_pair)
            {
                this->self += token_pair.first | token_pair.second;
            }
        );
    }

    operator_token_type operator_add;
    operator_token_type operator_sub;
    operator_token_type operator_mul;
    operator_token_type operator_div;

    value_token_type value;
    variable_token_type variable;

    std::vector<parenthesis_token_pair_type> parenthesis;

    whitespace_token_type whitespace;
};

namespace AST {
    using operation = operation::type;

    using value     = double;
    using variable  = std::string;

    struct bin_expr;
    using expression = boost::variant<value, variable, boost::recursive_wrapper<bin_expr> >;

    struct bin_expr {
        expression lhs, rhs;
        operation op;

        friend std::ostream& operator<<(std::ostream& os, bin_expr const& be) {
            return os << "(" << be.lhs << " " << be.op << " " << be.rhs << ")";
        }
    };
}

BOOST_FUSION_ADAPT_STRUCT(AST::bin_expr, lhs, op, rhs)

template<typename Iterator>
class expression_grammar : public qi::grammar<Iterator, AST::expression()>
{
public:
    template<typename Tokens>
    explicit expression_grammar(Tokens const& tokens)
        : expression_grammar::base_type(start)
    {
        start                     = expression >> qi::eoi;

        bin_sum_expr              = sum_operand >> sum_operator >> expression;
        bin_fac_expr              = fac_operand >> fac_operator >> sum_operand;

        expression                = bin_sum_expr | sum_operand;
        sum_operand               = bin_fac_expr | fac_operand;

        sum_operator              = tokens.operator_add >> qi::attr(AST::operation::add) | tokens.operator_sub >> qi::attr(AST::operation::sub);
        fac_operator              = tokens.operator_mul >> qi::attr(AST::operation::mul) | tokens.operator_div >> qi::attr(AST::operation::div);

        if(tokens.parenthesis.empty()) {
            fac_operand           = terminal;
        }
        else {
            fac_operand           = parenthesised | terminal;

            parenthesised         = tokens.parenthesis.front().first >> expression >> tokens.parenthesis.front().second;
            std::for_each(tokens.parenthesis.cbegin() + 1, tokens.parenthesis.cend(),
                    [&](typename Tokens::parenthesis_token_pair_type const& token_pair)
                    {
                        parenthesised = parenthesised.copy() | (token_pair.first >> expression >> token_pair.second);
                    });
        }

        terminal                  = tokens.value | tokens.variable;

        BOOST_SPIRIT_DEBUG_NODES(
                (start) (expression) (bin_sum_expr) (bin_fac_expr)
                (fac_operand) (terminal) (parenthesised) (sum_operand)
                (sum_operator) (fac_operator)
            );
    }

private:
    qi::rule<Iterator, AST::expression()> start;
    qi::rule<Iterator, AST::expression()> expression;
    qi::rule<Iterator, AST::expression()> sum_operand;
    qi::rule<Iterator, AST::expression()> fac_operand;
    qi::rule<Iterator, AST::expression()> terminal;
    qi::rule<Iterator, AST::expression()> parenthesised;

    qi::rule<Iterator, int()> sum_operator;
    qi::rule<Iterator, int()> fac_operator;

    // extra rules to help with AST creation
    qi::rule<Iterator, AST::bin_expr()> bin_sum_expr;
    qi::rule<Iterator, AST::bin_expr()> bin_fac_expr;
};

namespace RPN {
    using cell      = boost::variant<AST::operation, AST::value, AST::variable>;
    using rpn_stack = std::vector<cell>;

    struct transform : boost::static_visitor<> {
        void operator()(rpn_stack& stack, AST::expression const& e) const {
            boost::apply_visitor(boost::bind(*this, boost::ref(stack), ::_1), e);
        }
        void operator()(rpn_stack& stack, AST::bin_expr const& e) const {
            (*this)(stack, e.lhs);
            (*this)(stack, e.rhs);
            stack.push_back(e.op);
        }
        void operator()(rpn_stack& stack, AST::value    const& v) const { stack.push_back(v); }
        void operator()(rpn_stack& stack, AST::variable const& v) const { stack.push_back(v); }
    };
}

int main()
{
    typedef lex::lexertl::token<std::string::const_iterator, boost::mpl::vector<operation::type, double, std::string>> token_type;
    typedef expression_lexer<lex::lexertl::actor_lexer<token_type>> expression_lexer_type;
    typedef expression_lexer_type::iterator_type expression_lexer_iterator_type;
    typedef expression_grammar<expression_lexer_iterator_type> expression_grammar_type;

    expression_lexer_type lexer;
    expression_grammar_type grammar(lexer);
    RPN::transform compiler;

    std::string line;
    while(std::getline(std::cin, line) && !line.empty())
    {
        std::string::const_iterator first = line.begin();
        std::string::const_iterator const last = line.end();

        AST::expression expr;
        bool const result = lex::tokenize_and_parse(first, last, lexer, grammar, expr);
        if(!result)
            std::cout << "Parsing failed!\n";
        else
        {
            std::cout << "Parsing success: " << expr << "\n";

            RPN::rpn_stack program;
            compiler(program, expr);

            for (auto& instr : program) {
                std::cout << instr << " ";
            }
        }

        if(first != last)
            std::cout << "Remainder: >" << std::string(first, last) << "<\n";
    }
}

【讨论】：

重点是如何在 Qi 规则中使用 Lex 标记属性。你似乎没有直接回答。间接地，我注意到您跳过了运算符标记类型的 operation::type 属性，而是（我猜它是这样工作的）您从语法本身（使用 qi::attr）人为地注入了该属性。虽然这行得通，但您没有解释为什么要这样做。（特别是您为运算符保留了词法分析器的操作-您是否忘记删除它们？）在接受您的答案之前，我会再考虑一下。谢谢！
最近我还注意到，与我之前所想的不同，value 令牌的属性 (double) 并没有神奇地构建在令牌的匹配范围之外。相反，令牌的值是匹配的范围。即使mpl::vector 的token_type 中没有列出这种类型 - 它似乎是由“模板魔术”添加的。这使得 value 令牌上的 double 属性毫无意义......（除非 - 我猜 - 我会为它做出适当的语义操作......）
您还想解释一下BOOST_FUSION_ADAPT_STRUCT 以及为什么它以不同的顺序列出成员吗？ bin_sum_expr和bin_fac_expr规则是否让qi自动合成属性？
这是（不出所料）一个非常棒的答案。我唯一觉得缺少的是关于this one等情况的警告。
到目前为止，我的调查使我认为问题比我最初想象的要窄。 Lexer 标记属性通常在 Qi 中处理。除了 enum 类型。到目前为止，我不知道为什么会这样。但可能需要您使用qi::attr 的技巧。但为什么会这样呢？