unicode 号码到底是什么？答案

【问题标题】：what on earth the unicode number is?unicode 号码到底是什么？
【发布时间】：2012-09-08 11:12:04
【问题描述】：

在python中：

>>> "\xc4\xe3".decode("gbk").encode("utf-8")
'\xe4\xbd\xa0'
>>> "\xc4\xe3".decode("gbk")
u'\u4f60'

我们可以得出两个结论：

1.\xc4\xe3 in gbk 编码 = \xe4\xbd\xa0 in utf-8
2.\xc4\xe3 in gbk encode = \x4f\x60 in unicode（或者说在ucs-2中）

在R中：

> iconv("\xc4\xe3",from="gbk",to="utf-8",toRaw=TRUE)
[[1]]
[1] e4 bd a0
> iconv("\xc4\xe3",from="gbk",to="unicode",toRaw=TRUE)
[[1]]
[1] ff fe 60 4f

现在，结论1是正确的，它在python中和在R中是一样的
结论2是一个谜，
gbk 编码 = 中的 \xc4\xe3 到底是什么？在 unicode 中。
在 python 中是 u'\u4f60'，在 R 中是 ff fe 60 4f
平等吗？哪一个是正确的？它们都正确吗？

【问题讨论】：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Unicode in Wikipedia: "Unicode可以通过不同的字符编码来实现。最常用的编码是UTF-8、UTF-16和现在已经过时的UCS-2.... "
这篇维基百科文章讨论了 GBK 编码。 en.wikipedia.org/wiki/GBK本文介绍python中的Unicodedocs.python.org/howto/unicode.html
请阅读@delnan 评论下的文章 - 真的。

【解决方案1】：

在 python 中，\uxxxx 表示法指的是 Unicode 代码点，而不是这些代码点的任何编码。

UCS-2、UTF-16、UTF-8 都是能够以字节为单位捕获这些代码点的编码，适合存储在文件中、通过网络传输等。

\u4f60 代码点的 R 表示包括 UTF-16 Byte Order Mark 或 BOM。它指示选择的字节顺序，其中 0xFFFE 表示小端。当您编码为 UTF-16 时，Python 也包含它：

>>> u'\uf460'.encode('utf16')
'\xff\xfe`\xf4'

大端等效是 0xFEFF。您可以在 python 中显式编码为 utf-16be 或 utf-16le 以避免包含 BOM，因为您已经做出了明确的选择：

>>> u'\uf460'.encode('utf-16be')
'\xf4`'
>>> u'\uf460'.encode('utf-16le')
'`\xf4'

您真的应该阅读 Joel Spolsky Unicode 文章以及 Python Unicode HOWTO 以更全面地了解 Unicode 和编码之间的区别。

【讨论】：