Unicode码点转换中的“语言处理”？答案

【问题标题】："Linguistic processing" in Unicode code point conversion?Unicode码点转换中的“语言处理”？
【发布时间】：2016-06-22 04:05:45
【问题描述】：

Char.ConvertFromUtf32 的 MSDN 文档指出：

基本多语言平面 (BMP) 之外的有效代码点始终会产生有效的代理对。 但是，根据 Unicode 标准，BMP 中的有效代码点可能不会产生有效的结果，因为在转换中没有使用任何语言处理。因此，请使用 System.Text::UTF32Encoding 类将批量 UTF-32 数据转换为批量 UTF-16 数据。

什么是上面所说的“语言处理”？对于 BMP 中的字符，Char.ConvertFromUtf32(i)[0] 调用是否会产生与 (char)i 不同的结果？

【问题讨论】：

标签： c# .net unicode char

【解决方案1】：

for (int i = 0; i < 65535; i++)
{
    char ch1 = (char)i;

    if (i < 0x0d800 || i > 0xdfff)
    {
        string str1 = char.ConvertFromUtf32(i);

        if (str1.Length != 1)
        {
            Console.WriteLine("\\u+{0:x4}: char.ConvertFromUtf32(i).Length = {1}", i, str1.Length);
        }

        char ch2 = str1[0];

        if (ch1 != ch2)
        {
            Console.WriteLine("\\u+{0:x4}: (char)i = 0x{1:x4}, char.ConvertFromUtf32(i)[0] = 0x{2:x4}", i, (int)ch1, (int)ch2);
        }
    }

    byte[] bytes = BitConverter.GetBytes(i);
    string str2 = Encoding.UTF32.GetString(bytes);

    if (str2.Length != 1)
    {
        Console.WriteLine("\\u+{0:x4}: Encoding.UTF32.GetString(bytes).Length = {1}", i, str2.Length);
    }

    char ch3 = str2[0];

    if (ch1 != ch3)
    {
        Console.WriteLine("\\u+{0:x4}: (char)i = 0x{1:x4}, Encoding.UTF32.GetString(bytes)[0] = 0x{2:x4}", i, (int)ch1, (int)ch3);
    }
}

唯一的区别似乎在 0xd800 - 0xdfff 范围内，其中char.ConvertFromUtf32() 将引发异常，而Encoding.UTF32.GetString() 将针对无效字符返回 0xfffd。

在reference source上我们可以清楚的看到UTF32字符没有“特殊处理”。

if (iChar >= 0x10000)
{
    *(chars++) = GetHighSurrogate(iChar);
    iChar = GetLowSurrogate(iChar);
}

// Add the rest of the surrogate or our normal character
*(chars++) = (char)iChar;

（我已经省略了与本次讨论无关的多行代码）

【讨论】：

感谢您编写代码进行检查！ U+D800–U+DFFF 范围是为代理字符保留的，它们作为 UTF-16 之外的代码点是无效的，因此可以预期异常/回退字符。如果这是唯一的区别，我认为 MSDN 文档是错误的，并且可能是指一些不应成为代码点转换一部分的 Unicode 规范化。
@Douglas 我什至检查了UTF32Encoding的参考来源，并没有“特殊处理”。
很遗憾，MSDN 已经取消了在其文档中添加 cmets 的功能。应该指出诸如此类的错误。