Boost locale normalize 去除字符但没有重音符号答案

【问题标题】：Boost locale normalize strips characters but no accentsBoost locale normalize 去除字符但没有重音符号
【发布时间】：2020-10-12 12:44:01
【问题描述】：

我正在尝试使用 boost 本地库从字符串中去除重音符号。

normalize 函数会删除带有重音的整个字符，我只想删除重音。

è -> e 例如

这是我的代码

std::string hello(u8"élève");
boost::locale::generator gen;
std::string str = boost::locale::normalize(hello,boost::locale::norm_nfd,gen(""));

期望的输出：eleve

我的输出：lve

请帮忙

【问题讨论】：

标签： c++ c++11 boost boost-locale

【解决方案1】：

这不是 normalize 所做的。使用nfd，它会进行“规范分解”。您需要 THEN 删除组合字符代码点。

更新添加从 the utf8 tables 收集的松散实现，大多数组合字符似乎以 0xcc 或 0xcd 开头：

Live On Wandbox

// also liable to strip some greek characters that lead with 0xcd
template <typename Str>
static Str try_strip_diacritics(
    Str const& input,
    std::locale const& loc = std::locale())
{
    using Ch = typename Str::value_type;
    using T = boost::locale::utf::utf_traits<Ch>;

    auto tmp = boost::locale::normalize(
                input, boost::locale::norm_nfd, loc);

    auto f = tmp.begin(), l = tmp.end(), out = f;

    while (f!=l) {
        switch(*f) {
            case '\xcc':
            case '\xcd': // TODO find more
                T::decode(f, l);
                break; // skip
            default:
                out = T::encode(T::decode(f, l), out);
                break;
        }
    }
    tmp.erase(out, l);
    return tmp;
}

打印（在我的盒子上！）：

Before: "élève"  0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
all-in-one: "eleve"  0x65 0x6c 0x65 0x76 0x65

旧答案文本/分析：

#include <boost/locale.hpp>
#include <iomanip>
#include <iostream>

static void dump(std::string const& s) {
    std::cout << std::hex << std::showbase << std::setfill('0');
    for (uint8_t ch : s)
        std::cout << " " << std::setw(4) << int(ch);
    std::cout << std::endl;
}

int main() {
    boost::locale::generator gen;

    std::string const pupil(u8"élève");

    std::string const str =
        boost::locale::normalize(
            pupil,
            boost::locale::norm_nfd,
            gen(""));

    std::cout << "Before: "; dump(pupil);
    std::cout << "After:  "; dump(str);
}

打印，在我的盒子上：

Before:  0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
After:   0x65 0xcc 0x81 0x6c 0x65 0xcc 0x80 0x76 0x65

但是，在 Coliru 上它是 makes no difference。这表明它取决于可用/系统语言环境。

文档说：https://www.boost.org/doc/libs/1_72_0/libs/locale/doc/html/conversions.html#conversions_normalization

Unicode 规范化是将字符串转换为标准格式，适合文本处理和比较。为了例如，字符“ü”可以由单个代码点或字符“u”和分音符“¨”的组合。正常化是Unicode文本处理的重要组成部分。

Unicode 定义了四种规范化形式。每个特定的形式是由传递给规范化函数的标志选择：

NFD - 规范分解 - boost::locale::norm_nfd

NFC - 规范分解后跟规范组合 - boost::locale::norm_nfc 或 boost::locale::norm_default

NFKD - 兼容性分解 - boost::locale::norm_nfkd

NFKC - 兼容性分解，然后是规范组合 - boost::locale::norm_nfkc

有关规范化形式的更多详细信息，请阅读[本文][1]。

你能做什么

您似乎可以通过仅进行分解（因此 NFD）然后删除任何不是 alpha 的代码点来获得一些方法。

~~这是作弊，因为它假定所有代码点都是单个单元，这通常不是正确的，但对于示例它确实有效：~~

请参阅上面的改进版本，它会迭代代码点而不是字节。

【讨论】：

我刚刚找到了一种更准确的方法来使用语言环境库中分解的正常形式和 UTF 特征来近似这种行为，并添加
添加了迭代代码点而不是字节的改进版本