如何在 C++ 字符串中检测“â€‹”（unicode 的组合）答案

【问题标题】：how to detect "â€‹" (combination of unicode) in c++ string如何在 C++ 字符串中检测“â€‹”（unicode 的组合）
【发布时间】：2018-12-15 01:58:15
【问题描述】：

我正在尝试检测一些 Unicode 字符组合（如 â€‹）来清理字符串，对于单个 Unicode 字符它正在检测但 Unicode 组合没有检测到。

我用来从另一个需要清理的 HTML 页面制作 HTML 页面的字符串。我只想清理具有这种 unicode 的字符串，这些 unicode 甚至在浏览器的 html 页面中都不可见。

下面是示例代码：

void detect_Unicode(string& str) { 

      if(!str.empty() && str.find_first_not_of(" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039")==string::npos)
                str.assign(" ");
      return;
 }

输入字符串：

1. " â€‹    â€‹ " ;
2. "are Â Â there is something Â Â Â â€‹ combination    â€‹"  
3. " Â Â "   
4. "â€‹  Â Â â€‹" 
5 . "Â Â â â"

预期输出：

1. " "  
2. "are Â Â there is something Â Â Â â€‹ combination    â€‹"   
3. " "  
4. " "  
5. " "

请告诉我其他方法。

【问题讨论】：

如果可以，请使用std::wstring
std::string 不包含 unicode 字符，而是“编码”字节（可能是 utf-8）。所以对于多字节字符，你必须使用std::search 而不是find_first_not_of。
@PaulSanders: wchar 不保证为 2，即使在这种情况下，unicode 也可能需要多个 wchars。
@Jarod42 你能解释一下我如何使用std::search 和string
@Jarod452 wchar 不保证是 2 我想我从来没有声称它是。

标签： html c++ string unicode character-encoding

【解决方案1】：

好的，从上面的 cmets 开始，我认为输入字符串很可能是 UTF-8（毕竟，在 HTML 上下文中，它还会是什么？）。

在此基础上，我谦虚地提交：

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& ws)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (ws);
}

std::wstring widen (const std::string& s)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (s);
}

std::string detect_Unicode (const std::string& s)
{ 
    std::wstring ws = widen (s);
    if (ws.empty() || ws.find_first_not_of (L" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039") != std::wstring::npos)
        return " ";
    return s;
}

#include <iostream>

int main ()
{
    std::cout << narrow (L"\u00A0 \u00C2 \u00E2 \u20AC \u2039\n\n");
    std::cout << "0.\t\"" << detect_Unicode (u8"abcde") << "\"\n";
    std::cout << "1.\t\"" << detect_Unicode (u8" â€‹    â€‹ ") << "\"\n";
    std::cout << "2.\t\"" << detect_Unicode (u8"are Â Â there is something Â Â Â â€‹ combination    â€‹") << "\"\n";
    std::cout << "3.\t\"" << detect_Unicode (u8" Â Â ") << "\"\n";
    std::cout << "4.\t\"" << detect_Unicode (u8"â€‹  Â Â â€‹") << "\"\n";
    std::cout << "5.\t\"" << detect_Unicode (u8"Â Â â â") << "\"\n";
}

输出：

  Â â € ‹

0.  " "
1.  " â€‹    â€‹ "
2.  " "
3.  " Â Â "
4.  "â€‹  Â Â â€‹"
5.  "Â Â â â"

现在这不是 OP 所期望的输出，但我认为这仅仅是因为 detect_Unicode() 的逻辑（与实现相反）看起来有缺陷。这里的重点是，将输入字符串转换为宽字符串意味着您可以可靠地对其使用标准的basic_string 操作，因为现在不存在多字节问题。

detect_Unicode() 的另一种稍微激进的实现可能是：

for (auto wide_char : ws)
{
    if (wide_char > 0xff)
        return " ";
}
return s;

但是说真的，现在你有一个很宽的字符串要提交detect_Unicode，一切皆有可能，所以疯狂的 OP。

其他说明：

std::codecvt 在 C++17 中已被弃用，但由于没有其他明显的选择，您最好使用它。如果需要，您可以随时更改 narrow 和 widen 的实现。
视平台而定，std::wstring 可能不是最佳选择，但可能没问题。您还可以查看std::u16string 和std::u32string。

Live demo.

灵感来自here。

【讨论】：

这对我来说似乎不错，但它并没有像输入 std::cout << "1.\t\"" << detect_Unicode (u8" â€‹ â€‹ ") << "\"\n"; 那样处理所有情况，它的输出应该是 1. " "
其他由您的代码处理的情况可以通过这个条件简单地完成if(!str.empty() && str.find_first_not_of(" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039")==string::npos) str.assign(" ");
...它的输出应该是 1。" " 为什么？ ...可以通过这个条件简单地完成这如何改善事情？
在您的现场演示中，它按预期工作，但我的字符串没有按预期工作，看来我的字符串不是 UTF8。
这个“â€‹”没有为我检测到。