将十六进制转换为 UTF8 在 perl 中无法按预期工作答案

【问题标题】：Converting hex into UTF8 not working as expected in perl将十六进制转换为 UTF8 在 perl 中无法按预期工作
【发布时间】：2020-04-04 04:11:06
【问题描述】：

我正在尝试理解 perl 中的 UTF8。

我有以下字符串 Alizéh。如果我查找此字符串的十六进制，我会从 https://onlineutf8tools.com/convert-utf8-to-hexadecimal 得到 416c697ac3a968（这与此字符串的原始来源匹配）。

所以我认为打包该十六进制并将其编码为 utf8 应该会产生 unicode 字符串。但它会产生一些非常不同的东西。

有人能解释我哪里错了吗？

这是一个简单的测试程序来展示我的工作。

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";

打印出来：

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as AlizÃ©h
Utf8 encoding the string produces AlizÃÂ©h
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish

关于如何在 perl 中获取 UTF8 字符串的十六进制值并将其转换为有效的 UTF8 标量的任何提示？

我将在这个扩展版本中解释一些更奇怪的地方

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";

print "=========================================== Unaccent test start\n";

my $plaintest = unac_string('utf8', "Alizéh");

print "Alizéh passed to the unaccent gives $plaintest\n";


my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as  $cleanpackedHexIntoPlainString\n";

my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Unaccenting the packed version gives $packedtest\n";

utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";

$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Now unaccenting the packed version gives $packedtest\n";

print "=========================================== Unaccent test finish\n\n";

这打印：

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as AlizÃ©h
Utf8 encoding the string produces AlizÃÂ©h
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish

=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as  AlizÃ©h
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as AlizÃÂ©h
Now unaccenting the packed version gives AlizAÂ©h
=========================================== Unaccent test finish

在这个测试中，unaccent 库似乎接受了十六进制字符串的打包版本。我不知道为什么，有人可以帮我理解为什么会这样吗？

【问题讨论】：

附带说明，Text::Unidecode 在相同类型的问题空间中执行，并且处于更可靠的状态。

标签： perl utf-8

【解决方案1】：

Unicode 字符串是 Perl 中的一等值，您无需跳过这些环节。您只需要识别和跟踪何时有字节，何时有字符，Perl 不会为您区分，所有字节字符串也是有效字符串。实际上，您正在对字符串进行双重编码，这些字符串仍然有效，因为 UTF-8 编码字节表示（对应的字符）您的 UTF-8 编码字节。

use utf8; 将从 UTF-8 解码您的源代码，因此通过声明您的以下文字字符串已经是 unicode 字符串并且可以传递给任何正确接受字符的 API。要从 UTF-8 字节字符串中获得相同的结果（正如您通过打包字节的十六进制表示来生成的那样），请使用 decode from Encode（或我的 nicer wrapper）。

use strict;
use warnings;
use utf8;
use Encode 'decode';

my $str = 'Alizéh'; # already decoded
my $hex = '416c697ac3a968';
my $bytes = pack 'H*', $hex;
my $chars = decode 'UTF-8', $bytes;

Unicode 字符串需要编码为 UTF-8 才能输出到需要字节的内容，例如 STDOUT； :encoding(UTF-8) 层可以应用于此类句柄以自动执行此操作，并且从输入句柄自动解码也是如此。应该应用什么的确切性质完全取决于你的角色来自哪里以及他们要去哪里。有关可用选项的太多信息，请参阅this answer。

use Encode 'encode';
print encode 'UTF-8', "$chars\n";
binmode *STDOUT, ':encoding(UTF-8)'; # warning: global effect
print "$chars\n";

【讨论】：

完美。我只想补充一点，binmode *STDOUT, ':encoding(UTF-8)' 是use open ':std', ':encoding(UTF-8)'; 所做的事情之一
Re "my $str = 'Alizéh'; # already decoded"，因为use utf8;已经解码，以防不清楚。
谢谢。我这样做的原因是为了处理以错误编码从数据库返回的字符串。事实证明，它们是作为 unicode 放入数据库中的，但是当它们被读出时，Sybase 驱动程序正在将它们重新编码为 unicode。所以解决方案是对字符串进行两次解码。但是这个答案有助于找出问题所在。我的编码解码方式错误，您的解释很完美，谢谢！