【发布时间】:2020-04-04 04:11:06
【问题描述】:
我正在尝试理解 perl 中的 UTF8。
我有以下字符串 Alizéh。如果我查找此字符串的十六进制,我会从 https://onlineutf8tools.com/convert-utf8-to-hexadecimal 得到 416c697ac3a968(这与此字符串的原始来源匹配)。
所以我认为打包该十六进制并将其编码为 utf8 应该会产生 unicode 字符串。但它会产生一些非常不同的东西。
有人能解释我哪里错了吗?
这是一个简单的测试程序来展示我的工作。
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
打印出来:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
关于如何在 perl 中获取 UTF8 字符串的十六进制值并将其转换为有效的 UTF8 标量的任何提示?
我将在这个扩展版本中解释一些更奇怪的地方
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
print "=========================================== Unaccent test start\n";
my $plaintest = unac_string('utf8', "Alizéh");
print "Alizéh passed to the unaccent gives $plaintest\n";
my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as $cleanpackedHexIntoPlainString\n";
my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Unaccenting the packed version gives $packedtest\n";
utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";
$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Now unaccenting the packed version gives $packedtest\n";
print "=========================================== Unaccent test finish\n\n";
这打印:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish
在这个测试中,unaccent 库似乎接受了十六进制字符串的打包版本。我不知道为什么,有人可以帮我理解为什么会这样吗?
【问题讨论】:
-
附带说明,Text::Unidecode 在相同类型的问题空间中执行,并且处于更可靠的状态。