逐字读取字符串时如何添加自定义分隔符答案

【问题标题】：How to add a custom delimiter when reading a string word by word逐字读取字符串时如何添加自定义分隔符
【发布时间】：2021-11-07 19:08:03
【问题描述】：

我正在逐字阅读文件，但有时长破折号（或 em 破折号）位于两个单词的中间，我想将其添加为另一个分隔符（除了标准空格）。

ifstream file;
file.open("example.txt");
string word;

while (file >> word)
{
    cout << word << endl;
}

例如，短语“他很年轻——可能从二十八岁到三十岁——又高又瘦”打印出以下文字：

He
was
young—perhaps
from
twenty-eight
to
thirty—tall
slender

“二十八”很好，但“年轻”和“也许”（以及“三十”和“高”）是两个不同的词，我想这样解读。

如何添加自定义分隔符“—”？

【问题讨论】：

我建议使用std::readline 逐行阅读文本并使用适当的解析技术对其进行解析。 >> 不是一个。
@n.1.8e9-where's-my-sharem。你的意思是 std::getline ？我不能用这个函数指定多个分隔符。
你不需要。您阅读了行文本。行由换行符分隔。然后你会在这些行中找到分隔符。
@n.1.8e9-where's-my-sharem。你能回答这个问题究竟是如何做到的吗？我似乎无法弄清楚这是如何工作的。

标签： c++ string file

【解决方案1】：

是的：

如果您想将单个字符（如普通的破折号“-”）视为空格，我会使用ctype facet。此类型指定本地如何处理字符。在这种情况下，我们可以告诉 facet '-' 是一种空格。

#include <locale>
#include <fstream>
#include <iostream>
#include <string>
#include <sstream>

// This is my facet:
// It adds the '-' character to the set of characters treated like a space.
class DashSepFacet: public std::ctype<char>
{
    public:
        typedef std::ctype<char>   base;
        typedef base::char_type    char_type;

    DashSepFacet(std::locale const& l) : base(table)
    {
        // Get the ctype facet of the current locale
        std::ctype<char> const&  defaultCType = std::use_facet<std::ctype<char> >(l);

        // Copy the default flags for each character from the current facet
        static char data[256];
        for(int loop = 0; loop < 256; ++loop) {data[loop] = loop;}
        defaultCType.is(data, data+256, table);

        // Add the '-' as a space
        table['-'] |= base::space;
    }
    private:
        base::mask table[256];
};

int main()
{
    // Create a stream (Create the locale) then imbue the stream.
    std::fstream data;
    data.imbue(std::locale(data.getloc(), new DashSepFacet(data.getloc())));
    data.open("X3");

    // Now you can use the stream like normal; your locale defines what
    // is whitespace, so the operator `>>` will split on dash.
    std::string   word;
    while(data >> word)
    {
        std::cout << "Word(" << word << ")\n";
    }
}

现在我们得到：

> ./a.out
Word(He)
Word(was)
Word(young—perhaps)
Word(from)
Word(twenty)
Word(eight)
Word(to)
Word(thirty—tall)
Word(slender)

不幸的是，em-dash 是一个 unicode 代码点，实际上由 3 个字符表示，因此上述技术不起作用。相反，您可以使用char_traits facet 告诉本地转换字符序列（通常用于在格式之间转换）。在这种情况下，我们编写了一个将em-dash 转换为文字空格字符的版本。

#include <locale>
#include <fstream>
#include <iostream>
#include <string>
#include <sstream>

#include <locale>
#include <string>
#include <iostream>
#include <fstream>
#include <cctype>

class PunctRemove: public std::codecvt<char,char,std::char_traits<char>::state_type>
{
    bool do_always_noconv() const throw()  { return false;}
    int do_encoding()       const throw()  { return true; }

    typedef std::codecvt<char,char,std::char_traits<char>::state_type> MyType;
    typedef MyType::state_type          state_type;
    typedef MyType::result              result;

    virtual result  do_in(state_type& s,
            const char* from,const char* from_end,const char*& from_next,
            char* to,        char* to_limit,      char*& to_next  ) const
    {
        // Unicode for em-dash is
        // e2  80  94
        static int emdashpos = 0;

        /*
         * This function is used to filter the input
         */
        for(from_next = from, to_next = to;from_next != from_end;++from_next)
        {
            // Note we do it this way.
            // because the multi byte em-dash may be split across buffer boundaries.
            if (emdashpos == 0 && *from_next == '\xe2') {
                ++emdashpos;
                continue;
            }
            else if (emdashpos == 1 && *from_next == '\x80') {
                ++emdashpos;
                continue;
            }
            else if (emdashpos == 2 && *from_next == '\x94') {
                *to_next = ' ';
                ++to_next;
                emdashpos=0;
                continue;
            }
            // --- Account for times when we received some characters but not all
            if (emdashpos != 0) {
                from_next -= emdashpos;
                emdashpos = 0;
            }

            // Normal processing.
            *to_next = *from_next;
            ++to_next;
        }
        return ok;
    }

    /*
     * This function is used to filter the output
     */
    virtual result do_out(state_type& state,
            const char* from, const char* from_end, const char*& from_next,
            char* to,         char* to_limit,       char*& to_next  ) const
    { /* Write if you need it */ return ok;}
};


int main()
{
    // Create a stream (Create the locale) then imbue the stream.
    std::ifstream data;
    data.imbue(std::locale(data.getloc(), new PunctRemove()));
    data.open("X3");

    // Now you can use the stream like normal; your locale is replacing the em-dash
    // with a normal space.
    std::string   word;
    while(data >> word)
    {
        std::cout << "Word(" << word << ")\n";
    }
}

现在我们得到：

> ./a.out
Word(He)
Word(was)
Word(young)
Word(perhaps)
Word(from)
Word(twenty-eight)
Word(to)
Word(thirty)
Word(tall)
Word(slender)

【讨论】：

\xe2 这样的幻数不是一个好主意，那么其他标点符号呢？代码很快就会变得非常笨拙。