如何用转义的空格分割句子？答案

【问题标题】：How to split a sentence with an escaped whitespace?如何用转义的空格分割句子？
【发布时间】：2015-04-01 00:33:26
【问题描述】：

我想使用空格作为分隔符来拆分我的句子，但转义的空格除外。使用 boost::split 和正则表达式，我该如何拆分它？如果不可能，还有什么办法？

例子：

std::string sentence = "My dog Fluffy\\ Cake likes to jump";

结果：
我的
狗
蓬松\蛋糕
喜欢
到
跳跃

【问题讨论】：

您可以使用 std::stringstream stackoverflow.com/a/236803/4603670 或正则表达式 regexr.com 来做到这一点
@BarmakShemirani 你将如何处理逃逸的空间？
@sehe，您可以使用 Boost Spirit、Boost Regex 或手写解析器。
@BarmakShemirani 哈哈。我会把它当作一种恭维:)

标签： c++ boost split whitespace delimiter

【解决方案1】：

三种实现方式：

有升压精神
使用 Boost 正则表达式
手写解析器

用精神振奋

以下是我如何使用 Boost Spirit 执行此操作。这可能看起来有点矫枉过正，但经验告诉我，一旦拆分输入文本，您可能需要更多的解析逻辑。

当您从“仅拆分标记”扩展到具有生产规则的真正语法时，Boost Spirit 会大放异彩。

Live On Coliru

#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";
    using It = std::string::const_iterator;
    It f = sentence.begin(), l = sentence.end();

    std::vector<std::string> words;

    bool ok = qi::phrase_parse(f, l,
            *qi::lexeme [ +('\\' >> qi::char_ | qi::graph) ], // words
            qi::space - "\\ ", // skipper
            words);

    if (ok) {
        std::cout << "Parsed:\n";
        for (auto& w : words)
            std::cout << "\t'" << w << "'\n";
    } else {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}

使用 Boost 正则表达式

这看起来很简洁，但是

需要链接到 boost_regex
在断言后面使用“黑魔法”否定外观：http://www.regular-expressions.info/lookaround.html

Live On Coliru

#include <iostream>
#include <boost/regex.hpp>
#include <boost/algorithm/string_regex.hpp>
#include <vector>

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;
    boost::algorithm::split_regex(words, sentence, boost::regex("(?<!\\\\)\\s"), boost::match_default);

    for (auto& w : words)
        std::cout << " '" << w << "'\n";
}

使用 c++11 原始文字，您可以稍微不那么晦涩地编写正则表达式：boost::regex(R"((?<!\\)\s)")，意思是“任何不跟在反斜杠后面的空格”

手写解析器

这有点乏味，但就像 Spirit 语法一样是完全通用的，并且可以提供很好的性能。

但是，一旦您开始增加语法的复杂性，它就不会像 Spirit 方法那样优雅地扩展。一个优点是编译代码的时间比 Spirit 版本少。

Live On Coliru

#include <iostream>
#include <iterator>
#include <vector>

template <typename It, typename Out>
Out tokens(It f, It l, Out out) {
    std::string accum;
    auto flush = [&] { 
        if (!accum.empty()) {
            *out++ = accum;
            accum.resize(0);
        }
    };

    while (f!=l) {
        switch(*f) {
            case '\\': 
                if (++f!=l && *f==' ')
                    accum += ' ';
                else
                    accum += '\\';
                break;
            case ' ': case '\t': case '\r': case '\n':
                ++f;
                flush();
                break;
            default:
                accum += *f++;
        }
    }
    flush();
    return out;
}

int main() {
    std::string const sentence = "My dog Fluffy\\ Cake likes to jump";

    std::vector<std::string> words;

    tokens(sentence.begin(), sentence.end(), back_inserter(words));

    for (auto& w : words)
        std::cout << "\t'" << w << "'\n";
}

【讨论】：

我使用了您提供的 boost 正则表达式，它运行良好。非常感谢。
@AppleJuice 你意识到你选择了丑陋的继子权利:) 唯一一个带有链接依赖项的，需要在你的人寿保险中豁免，并且即使在它被解析后也需要你手动删除它:) （幸运的是，它不需要像#1 那样的处女牺牲来编译；并且#3 引起c envy）。干杯