【问题标题】:Boost Spirit Signals Successful Parsing Despite Token Being Incomplete尽管令牌不完整,但 Boost Spirit 信号解析成功
【发布时间】:2012-10-12 18:24:42
【问题描述】:

我有一个非常简单的路径结构,我正在尝试使用 boost spirit.lex 进行解析。

我们有以下语法:

token := [a-z]+
path := (token : path) | (token)

所以我们在这里只讨论冒号分隔的小写 ASCII 字符串。

我有三个例子“xyz”、“abc:xyz”、“abc:xyz:”。

前两个应该被认为是有效的。第三个结尾有一个冒号,不应该被认为是有效的。不幸的是,我所拥有的解析器认为这三个都是有效的。语法不应允许空标记,但显然精神正在这样做。我错过了什么让第三个被拒绝?

另外,如果您阅读下面的代码,在 cmets 中有另一个版本的解析器要求所有路径都以分号结尾。当我激活这些行时,我可以获得适当的行为(即拒绝“abc:xyz:;”),但这并不是我真正想要的。

有人有什么想法吗?

谢谢。

#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>

#include <iostream>
#include <string>

using namespace boost::spirit;
using boost::phoenix::val;

template<typename Lexer>
struct PathTokens : boost::spirit::lex::lexer<Lexer>
{
      PathTokens()
      {
         identifier = "[a-z]+";
         separator = ":";

         this->self.add
            (identifier)
            (separator)
            (';')
            ;
      }
      boost::spirit::lex::token_def<std::string> identifier, separator;
};


template <typename Iterator>
struct PathGrammar 
   : boost::spirit::qi::grammar<Iterator> 
{
      template <typename TokenDef>
      PathGrammar(TokenDef const& tok)
         : PathGrammar::base_type(path)
      {
         using boost::spirit::_val;
         path
            = 
            (token >> tok.separator >> path)[std::cerr << _1 << "\n"]
            |
            //(token >> ';')[std::cerr << _1 << "\n"]
            (token)[std::cerr << _1 << "\n"]
             ; 

          token 
             = (tok.identifier) [_val=_1]
          ;

      }
      boost::spirit::qi::rule<Iterator> path;
      boost::spirit::qi::rule<Iterator, std::string()> token;
};


int main()
{
   typedef std::string::iterator BaseIteratorType;
   typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::string> > TokenType;
   typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
   typedef PathTokens<LexerType>::iterator_type TokensIterator;
   typedef std::vector<std::string> Tests;

   Tests paths;
   paths.push_back("abc");
   paths.push_back("abc:xyz");
   paths.push_back("abc:xyz:");
   /*
     paths.clear();
     paths.push_back("abc;");
     paths.push_back("abc:xyz;");
     paths.push_back("abc:xyz:;");
   */
   for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
   {
      std::string str = *iter;
      std::cerr << "*****" << str << "*****\n";

      PathTokens<LexerType> tokens;
      PathGrammar<TokensIterator> grammar(tokens);

      BaseIteratorType first = str.begin();
      BaseIteratorType last = str.end();

      bool r = boost::spirit::lex::tokenize_and_parse(first, last, tokens, grammar);

      std::cerr << r << " " << (first==last) << "\n";
   }
}

【问题讨论】:

    标签: c++ boost boost-spirit boost-spirit-lex


    【解决方案1】:

    除了 llonesmiz 已经说过的内容之外,还有一个使用 qi::eoi 的技巧,我有时会使用:

    path = (
               (token >> tok.separator >> path) [std::cerr << _1 << "\n"]
             | token                           [std::cerr << _1 << "\n"]
        ) >> eoi;
    

    这使得语法 require eoi(输入结束)在成功匹配的末尾。这导致了预期的结果:

    http://liveworkspace.org/code/23a7adb11889bbb2825097d7c553f71d

    *****abc*****
    abc
    1 1
    *****abc:xyz*****
    xyz
    abc
    1 1
    *****abc:xyz:*****
    xyz
    abc
    0 1
    

    【讨论】:

      【解决方案2】:

      问题在于你调用tokenize_and_parsefirstlast的含义。 first==last 检查您的字符串是否已完全标记,您无法推断出任何有关语法的内容。如果你像这样隔离解析,你会得到预期的结果:

        PathTokens<LexerType> tokens;
        PathGrammar<TokensIterator> grammar(tokens);
      
        BaseIteratorType first = str.begin();
        BaseIteratorType last = str.end();
      
        LexerType::iterator_type lexfirst = tokens.begin(first,last);
        LexerType::iterator_type lexlast = tokens.end();
      
      
        bool r = parse(lexfirst, lexlast, grammar);
      
        std::cerr << r << " " << (lexfirst==lexlast) << "\n";
      

      【讨论】:

      • 我插入了您的代码并且词法分析器的迭代器不相等。所以至少问题是可以检测到的。但是,有什么理由“r”不应该是假的。如果我只是给解析器“:”,它应该返回 false。
      • documentation 中,您可以看到解析函数“如果没有涉及的解析器组件失败,则返回 true,否则返回 false”。我理解的方式是,如果语法可以匹配您的“起始规则”(在您的示例中为path),它会返回 true,而与解析了多少字符串无关。这就是为什么您需要检查 first==last 以确保您的整个文本已被解析的原因。
      • 这是有道理的。自从写了原始帖子以来,我一直在尝试使用运算符“>”而不是运算符“>>”。当我这样做时,遇到尾随冒号时会出现异常。这似乎与“>>”运算符的行为有点不一致,但无论如何我肯定至少有一条前进的道路。感谢您的帮助。
      【解决方案3】:

      这就是我最终得到的结果。它使用了来自@sehe 和@llonesmiz 的建议。请注意转换为 std::wstring 以及在语法定义中使用操作,这些在原始帖子中不存在。

      #include <boost/config/warning_disable.hpp>
      #include <boost/spirit/include/qi.hpp>
      #include <boost/spirit/include/lex_lexertl.hpp>
      #include <boost/spirit/include/phoenix_operator.hpp>
      #include <boost/bind.hpp>
      
      #include <iostream>
      #include <string>
      
      //
      // This example uses boost spirit to parse a simple
      // colon-delimited grammar.
      //
      // The grammar we want to recognize is:
      //    identifier := [a-z]+
      //    separator = :
      //    path= (identifier separator path) | identifier
      //
      // From the boost spirit perspective this example shows
      // a few things I found hard to come by when building my
      // first parser.
      //    1. How to flag an incomplete token at the end of input
      //       as an error. (use of boost::spirit::eoi)
      //    2. How to bind an action on an instance of an object
      //       that is taken as input to the parser.
      //    3. Use of std::wstring.
      //    4. Use of the lexer iterator.
      //
      
      // This using directive will cause issues with boost::bind
      // when referencing placeholders such as _1.
      // using namespace boost::spirit;
      
      //! A class that tokenizes our input.
      template<typename Lexer>
      struct Tokens : boost::spirit::lex::lexer<Lexer>
      {
            Tokens()
            {
               identifier = L"[a-z]+";
               separator = L":";
      
               this->self.add
                  (identifier)
                  (separator)
                  ;
            }
            boost::spirit::lex::token_def<std::wstring, wchar_t> identifier, separator;
      };
      
      //! This class provides a callback that echoes strings to stderr.
      struct Echo
      {
            void echo(boost::fusion::vector<std::wstring> const& t) const
            {
               using namespace boost::fusion;
               std::wcerr << at_c<0>(t) << "\n";
            }
      };
      
      
      //! The definition of our grammar, as described above.
      template <typename Iterator>
      struct Grammar : boost::spirit::qi::grammar<Iterator> 
      {
            template <typename TokenDef>
            Grammar(TokenDef const& tok, Echo const& e)
               : Grammar::base_type(path)
            {
               using boost::spirit::_val;
               path
                  = 
                  ((token >> tok.separator >> path)[boost::bind(&Echo::echo, e,::_1)]
                   |
                   (token)[boost::bind(&Echo::echo, &e, ::_1)]
                   ) >> boost::spirit::eoi; // Look for end of input.
      
                token 
                   = (tok.identifier) [_val=boost::spirit::qi::_1]
                ;
      
            }
            boost::spirit::qi::rule<Iterator> path;
            boost::spirit::qi::rule<Iterator, std::wstring()> token;
      };
      
      
      int main()
      {
         // A set of typedefs to make things a little clearer. This stuff is
         // well described in the boost spirit documentation/examples.
         typedef std::wstring::iterator BaseIteratorType;
         typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::wstring> > TokenType;
         typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
         typedef Tokens<LexerType>::iterator_type TokensIterator;
         typedef LexerType::iterator_type LexerIterator;
      
         // Define some paths to parse.
         typedef std::vector<std::wstring> Tests;
         Tests paths;
         paths.push_back(L"abc");
         paths.push_back(L"abc:xyz");
         paths.push_back(L"abc:xyz:");
         paths.push_back(L":");
      
         // Parse 'em.
         for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
         {
            std::wstring str = *iter;
            std::wcerr << L"*****" << str << L"*****\n";
      
            Echo e;
            Tokens<LexerType> tokens;
            Grammar<TokensIterator> grammar(tokens, e);
      
            BaseIteratorType first = str.begin();
            BaseIteratorType last = str.end();
      
            // Have the lexer consume our string.
            LexerIterator lexFirst = tokens.begin(first, last);
            LexerIterator lexLast = tokens.end();
      
            // Have the parser consume the output of the lexer.
            bool r = boost::spirit::qi::parse(lexFirst, lexLast, grammar);
      
            // Print the status and whether or note all output of the lexer 
            // was processed.
            std::wcerr << r << L" " << (lexFirst==lexLast) << L"\n";
         }
      }
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-10-30
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多