带有 UTF-8 字符串的 C++“股票行情”答案

【问题标题】：C++ "stock ticker" with UTF-8 strings带有 UTF-8 字符串的 C++“股票行情”
【发布时间】：2020-07-25 01:29:12
【问题描述】：

项目的一部分包括类似于滚动“股票行情”的内容，其中较大的字符串“滚动”固定宽度的输出字符串。

在 Linux 上使用 C++ 11，使用拉丁字符时概念很清楚。像这样的：

std::string inputString, outputString;
for (int inIdx = 0; inIdx < inputString.size(); inIdx++)
{
    // shift output one character left
    for (int i = 0; i < mOutputTextWidth - 1; i++)
        outputString[i] = outputString[i+1];

    // Append character to end of output
    if (inIdx < inputString.size())
        outputString[mTextWidth] = inputString.at(inIdx);
    sleep(1);
}

你会得到类似的东西：

[           ]
[          H]
[         HE]
[        HEL]
[      HELLO]
[     HELLO ]
[    HELLO  ]
[   HELLO   ]

我需要使这项工作适用于 UTF-8 非拉丁字符。从我所读到的，这是一个复杂的主题。特别是 std::string::at 或 [] 返回一个字符，它会在长 UTF-8 字符上中断。

在 C++ 中，这样做的正确方法是什么？

例如。日语

[              ]
[            こ]
[          こん]
[        こんば]
[      こんばん]
[    こんばんは]
[  こんばんは  ]
[ こんばんは   ]

（我知道字形宽度会因语言而异，没关系。我只是不知道如何操作 UTF-8 字符串）

【问题讨论】：

我最近发布了一个类似问题的答案here。它可能有助于理解 UTF-8 在内存中的表示方式。
标准 C++ 中的 UTF-8 支持是粗略的。最佳行动方案很大程度上取决于您的平台和工具集。如果你想要可移植的代码，你可能想要使用第三方库。
此外，如果您想要最低限度的 Unicode 支持，您别无选择，只能使用第三方库。 C++ 没有工具来确定字符串的屏幕宽度，或者检查给定字符是常规字符、零宽度、双宽度还是组合字符。
n. 'pronouns' m：你对第三方库有什么建议吗？

标签： c++ utf-8 stdstring

【解决方案1】：

在多次针对wchar 发出警告后，我根据rustyx 中引用this post 的评论实施了一个解决方案。这种方法可能存在漏洞，但到目前为止，在使用英语/拉丁语和日语输入进行测试时对我有用。

（我相信下面的代码仅适用于 UTF-8，不确定其他传统编码，如 EUC-JP、SHIFT_JIS 等）

请注意，symbolLength() 标识了存在的代码点数量，它与屏幕宽度不同，因为可能存在不同的宽度（或零宽度！）代码点。

TqString::TqString(const std::string &s) { assign(s); }

TqString::TqString(const char *cs)
{
    std::string s(cs);
    assign(s);
}

TqString::TqString(size_t n, char c)
{
    std::string s(n, c);
    assign(s);
}

TqString &TqString::operator=(const std::string &s)
{
    assign(s);
    return *this;
}

// Unlike size(), this returns the number of UTF-8 code points
// in the input string

size_t
TqString::symbolLength() const
{
    int  symCount = 0;
    int skipCount = 0;

    for (int i = 0; i < size(); i++)
    {
        unsigned char c = at(i);
        if (skipCount == 0)
        {
            if (c >= 0xF0)
                skipCount = 3;
            else if (c >= 0xE0)
                skipCount = 2;
            else if (c >= 0xC0)
                skipCount = 1;
        }
        else
        {
            --skipCount;
        }

        if (skipCount > 0)
            continue;

        symCount++;
    }
    return symCount;
}

// Scan input string, skipping over 'n' symbols, and returning the last

std::string
TqString::symbolAt(off_t n) const
{
    std::string outString;
    int skipCount = 0;
    int symCount = 0;

    for (int i = 0; i < size(); i++)
    {
        unsigned char c = at(i);
        if (skipCount == 0)
        {
            outString = c;
            if (c >= 0xF0)
                skipCount = 3;
            else if (c >= 0xE0)
                skipCount = 2;
            else if (c >= 0xC0)
                skipCount = 1;
        }
        else
        {
            outString += c;
            --skipCount;
        }

        if (skipCount > 0)
            continue;


        if (symCount == n)
            break;

        symCount++;
    }

    return outString;
}

void
TqString::shiftLeft()
{
    std::string outString;
    if (size() == 0)
    {
        assign(outString);
        return;
    }

    for (int i = 1; i < symbolLength(); i++)
    {
        outString += symbolAt(i);
    }

    assign(outString);
}

// shift then append 's' to the end
void
TqString::shiftLeft(const TqString &s)
{
    shiftLeft();
    append(s);
}

std::string
TqString::str() const
{
    std::string ret(data());
    return ret;
}

【讨论】：

【解决方案2】：

在原生支持 Unicode 的系统上（包括 Linux）¹，您可以简单地使用标准的 C++ multibyte support 并使用 wchar_t 类型来一次处理一个 unicode 代码点。

例如这样：

#include <algorithm>
#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>

int main()
{
    std::string inputUTF8 = "こんばんは！"; // assuming this source is stored in UTF-8

    std::setlocale(LC_ALL, "en_US.utf8"); // tell mbstowcs use want UTF-8->wchar_t conversion
    std::wcout.imbue(std::locale("en_US.utf8")); // tell std::wcout we want wchar_t->UTF-8 output

    std::vector<wchar_t> buf(inputUTF8.size() + 1); // reserve space
    int len = (int)std::mbstowcs(buf.data(), inputUTF8.c_str(), buf.size()); // convert to wchar_t
    if (len == -1) {
        std::cerr << "Invalid UTF-8 input\n"; // mbstowcs can fail
        return 1;
    }
    std::wstring out;
    for (int i = 0; i < len * 2; i++)
    {
        out.assign(std::max(0, len - i), L'　'); // fill with ideographic space (U+3000) before

        out.append(buf.data(), std::max(0, i - len), std::min(len, i) - std::max(0, i - len));

        out.append(std::max(0, i - len), L'　'); // fill with ideographic space after

        std::wcout << L"[" << out << L"]\n";
    }
}

Output:

[　　　　　　]
[　　　　　こ]
[　　　　こん]
[　　　こんば]
[　　こんばん]
[　こんばんは]
[こんばんは！]
[んばんは！　]
[ばんは！　　]
[んは！　　　]
[は！　　　　]
[！　　　　　]

注意mbstowcs 和其他语言环境不是线程安全的。

另一种可能性是使用像iconv 这样的库。

¹ 不幸的是，Windows Unicode 支持被削弱了；它的 wchar_t 是 16 位长，实际上代表 UTF-16，因此该程序仅适用于 basic plane 代码点（仍包括典型的 CJK 符号，但不包括统一的韩文或 U+FFFF 以上的其他符号） .虽然这仍然可以通过考虑 UTF-16 来解决。

【讨论】：