为什么我在带有 XML::Parser 的 UTF-8 字符中间有一个额外的换行符？答案

【问题标题】：Why do I get an extra newline in the middle of a UTF-8 character with XML::Parser?为什么我在带有 XML::Parser 的 UTF-8 字符中间有一个额外的换行符？
【发布时间】：2011-01-31 11:32:43
【问题描述】：

我在处理 UTF-8、XML 和 Perl 时遇到了问题。以下是最小的一段代码和数据，以便重现问题。

这是一个需要解析的 XML 文件：

<?xml version="1.0" encoding="utf-8"?>
<test>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>

  [<words> .... </words> 148 times repeated]

  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
</test>

解析是用这个 perl 脚本完成的：

use warnings;
use strict;

use XML::Parser;
use Data::Dump;

my $in_words = 0;

my $xml_parser=new XML::Parser(Style=>'Stream');

$xml_parser->setHandlers (
   Start   => \&start_element,
   End     => \&end_element,
   Char    => \&character_data,
   Default => \&default);

open OUT, '>out.txt'; binmode (OUT, ":utf8");
open XML, 'xml_test.xml' or die;
$xml_parser->parse(*XML);
close XML;
close OUT;


sub start_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 1;
  }
  else {
    $in_words = 0;
  }
}

sub end_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 0;
  }
}

sub default {
  # nothing to see here;
}

sub character_data {
  my($parseinst, $data) = @_;

  if ($in_words) {
    if ($in_words) {
      print OUT "$data\n";
    }
  }
}

当脚本运行时，它会生成out.txt 文件。问题出在这第 147 行的文件。第 22 个字符（在 utf-8 中由 \xd6 \xb8 组成）被拆分在 d6 和 b8 之间换行。这不应该发生。

现在，如果其他人有这个问题或可以重现它，我很感兴趣。以及为什么我会遇到这个问题。我在 Windows 上运行这个脚本：

C:\temp>perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Binary build 1003 [285500] provided by ActiveState http://www.ActiveState.com
Built May 13 2008 16:52:49

【问题讨论】：

标签： xml perl utf-8

【解决方案1】：

当您使用显式 UTF-8 编码打开输入文件时会发生什么？

 open XML, '<:utf8', 'xml_test.xml' or die;

永远不要相信任何东西都能通过猜测得到正确的编码。尽可能自己显式添加编码。

另外，您确定输入正确吗？它是否通过其他工具（例如 xmllint）的验证。我知道 XML::Parser 应该能捕捉到这种东西，但让我们验证一下。

另外，你能把有问题的输入放到一个字符串中，然后再次打印它而不会出现问题吗？当您只删除 XML 文件的那一部分时会发生什么？是否会为另一条记录弹出相同的错误？

【讨论】：

当我用open XML, '<:utf8'... 打开它并删除binmode (OUT, ":utf8") 文件out.txt 按我的预期写入，但是，脚本在用Out of memory! 写入文件后崩溃跨度>
我刚刚检查了输入，并且 XML 是格式正确的 XML。

【解决方案2】：

我没有观察到这一点

C:\Temp> perl -v

这是为 MSWin32-x86-multi-thread 构建的 perl v5.10.1
（带有 2 个已注册的补丁，请参阅 perl -V 了解更多详细信息）

版权所有 1987-2009，拉里·沃尔

二进制构建 1006 [291086] 由 ActiveState http://www.ActiveState.com 提供
建于 2009 年 8 月 24 日 13:48:26

C:\Temp> perl -MXML::Parser -e "print $XML::Parser::VERSION"
2.36

【讨论】：

这很有趣。我已将 Activesate Perl 更新到 V5.10.1，现在它可以正常工作了。
是的，任何以零作为点发布的软件都没有真正准备好生产。 :)