Unicode字符串的跨平台迭代（使用ICU计算字形）答案

【问题标题】：Cross-platform iteration of Unicode string (counting Graphemes using ICU)Unicode字符串的跨平台迭代（使用ICU计算字形）
【发布时间】：2011-01-02 16:11:44
【问题描述】：

我想迭代 Unicode 字符串的每个字符，将每个代理对处理并将字符序列组合为一个单元（一个字形）。

示例

文本“नमस्ते”由代码点组成：U+0928, U+092E, U+0938, U+094D, U+0924, U+0947，其中U+0938 和U+0947 是组合标记。

static void Main(string[] args)
{
    const string s = "नमस्ते";

    Console.WriteLine(s.Length); // Ouptuts "6"

    var l = 0;
    var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
    while(e.MoveNext()) l++;
    Console.WriteLine(l); // Outputs "4"
}

所以我们在 .NET 中有它。我们也有 Win32 的CharNextW()

#include <Windows.h>
#include <iostream>
#include <string>

int main()
{
    const wchar_t * s = L"नमस्ते";

    std::cout << std::wstring(s).length() << std::endl; // Gives "6"

    int l = 0;
    while(CharNextW(s) != s)
    {
        s = CharNextW(s);
        ++l;
    }

    std::cout << l << std::endl; // Gives "4"

    return 0;
}

问题

我所知道的两种方式都是微软特有的。有便携方法吗？

我听说过 ICU，但我无法快速找到相关内容（UnicodeString(s).length() 仍然给出 6）。指向 ICU 中的相关功能/模块是一个可以接受的答案。
C++ 没有 Unicode 的概念，因此用于处理这些问题的轻量级跨平台库将是一个可以接受的答案。

编辑：使用 ICU 的正确答案

@McDowell 提示使用来自 ICU 的BreakIterator，我认为这可以被视为处理 Unicode 的事实上的跨平台标准。这是一个示例代码来演示它的使用（因为示例令人惊讶地很少见）：

#include <unicode/schriter.h>
#include <unicode/brkiter.h>

#include <iostream>
#include <cassert>
#include <memory>

int main()
{
    const UnicodeString str(L"नमस्ते");

    {
        // StringCharacterIterator doesn't seem to recognize graphemes
        StringCharacterIterator iter(str);
        int count = 0;
        while(iter.hasNext())
        {
            ++count;
            iter.next();
        }
        std::cout << count << std::endl; // Gives "6"
    }

    {
        // BreakIterator works!!
        UErrorCode err = U_ZERO_ERROR;
        std::unique_ptr<BreakIterator> iter(
            BreakIterator::createCharacterInstance(Locale::getDefault(), err));
        assert(U_SUCCESS(err));
        iter->setText(str);

        int count = 0;
        while(iter->next() != BreakIterator::DONE) ++count;
        std::cout << count << std::endl; // Gives "4"
    }

    return 0;
}

【问题讨论】：

你的标题应该是：“UTF-16 字符串的跨平台迭代”
@Chris：不过，问题并不特定于 UTF-16，所以它更像是“unicode non-utf32”:)
代理对是一个 utf-16 工件。 utf-8 只是编码最终的代码点。
这个问题涉及到组合标记。此功能允许文本以不止一种自然方式包含复合字形。文本可能包含带有重音符号的字符，可以用一个非重音字符后跟一个将重音符号添加到前一个字符的组合标记来表示这些字素，或者通过单个代码点（如果有的话）来表示重音字符。这与 UTF-16 中的代理完全不同，组合标记同样适用于所有 unicode 编码。
ICU 的 CharacterIterator，我想。像 StringCharacterIterator。

标签： c++ unicode cross-platform icu

【解决方案1】：

您应该能够为此使用 ICU BreakIterator（假设它与 Java 版本功能等效的字符实例）。

【讨论】：

谢谢！这正是我要找的！

【解决方案2】：

Glib 的 ustring 类为您提供 utf-8 字符串，如果您可以使用 utf-8。它的设计类似于std::string。由于 utf-8 是 Linux 原生的，因此您的任务非常简单：

int main()
{
    Glib::ustring s = L"नमस्ते";
    cout << s.size();
}

您也可以像往常一样使用 Glib::ustring::iterator 迭代字符串的字符

【讨论】：

这将消除代理对问题，但根本不处理组合字符。
@aschepler：你能解释一下组合字符是什么意思吗？
en.wikipedia.org/wiki/Combining_character 。此处的示例字符串有 6 个代码点（不涉及代理对）。其中 2 个是组合字符，因此字符串有 4 个字素。 @kizzx2 想要遍历这些字形。

【解决方案3】：

ICU 的界面非常老旧，Boost.Locale 更好：

#include <iostream>
#include <string_view>

#include <boost/locale.hpp>

using namespace std::string_view_literals;

int main()
{
    boost::locale::generator gen;
    auto string = "noël ??"sv;
    boost::locale::boundary::csegment_index map{
        boost::locale::boundary::character, std::begin(string),
        std::end(string), gen("")};
    for (const auto& i : map)
    {
        std::cout << i << '\n';
    }
}

文字来自here

【讨论】：