尝试在 Perl 中解析 XML，但长数据字符串被截断答案

【问题标题】：Trying to parse XML in Perl, but long data string gets cutoff尝试在 Perl 中解析 XML，但长数据字符串被截断
【发布时间】：2011-09-09 04:38:30
【问题描述】：

我尝试使用 XML::Simple 和 XML::Twig 解析 XML 文件，结果相同。文件中的其他字段工作得很好。

有问题的文件可以在这里检索：

curl -s "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130"

这是解析器问题还是文件问题？两个解析器的输出相同。字符串中的 HTML 标记存储在 XML 中

输入字段（在名为“summary”的 xml-tags 内）：

<summary type="html">&lt;p&gt;Toxoplasmosis is a disease caused by the parasite &lt;em&gt;Toxoplasma gondii&lt;/em&gt;. More than 60 million people in the U.S. have the parasite.  Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak immune systems and babies whose mothers become infected for the first time during pregnancy. Problems can include damage to the brain, eyes and other organs.&lt;/p&gt;&#xd;^I&#xd;&lt;p&gt;You can get toxoplasmosis from &lt;/p&gt;&#xd;&lt;ul&gt;&#xd;&lt;li&gt;^IWaste from an infected cat&lt;/li&gt;&#xd;&lt;li&gt;^IEating contaminated meat that is raw or not well cooked &lt;/li&gt;&#xd;&lt;li&gt;^IUsing utensils or cutting boards after they've had contact with raw meat &lt;/li&gt;&#xd;&lt;li&gt;^IDrinking infected water &lt;/li&gt;&#xd;&lt;li&gt;^IReceiving an infected organ transplant or blood transfusion&lt;/li&gt;&#xd;&lt;/ul&gt;&#xd;&lt;p&gt;Most people with toxoplasmosis don't need treatment. There are drugs to treat it for pregnant women and people with weak immune systems. &lt;/p&gt;&#xd;&#xd;&lt;p class="NLMattribution"&gt;Centers for Disease Control and Prevention&lt;/p&gt;</summary>

XML解析后的输出：

<p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite.  Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak im<p class="NLMattribution">Centers for Disease Control and Prevention</p>to treat it for pregnant women and people with weak immune systems. </p>her organs.</p>

问题的解决方案： XML 文件包含一个回车符“ " 这会导致解析器出现问题。下载 XML 文件后，我删除了回车符，并带有以下行：

sed -i 's/&#xd;//g' *.xml

解析器现在按预期工作。

更新： 回车不会影响解析器，只会影响被截断和混淆的输出。然而，删除它确实解决了我的问题。

【问题讨论】：

如果您知道解决方案，请关闭问题...
其实是字符不会给解析器带来问题。我怀疑它们会在您打印结果时引起问题。特别是如果您正在使用 Unix 机器。如果将结果输出到文件，您应该能够看到整个文本，包括一些 ^M 字符，这些字符在您打印时看起来像是缺少部分文本。但是，如果没有看到您的代码，很难判断。
是的，这似乎是对的，mirod。打印输出是错误的，一些部分被移除，其他部分在其他部分之间。我已经用这个信息更新了帖子。
你让我担心了一段时间！ XML::Twig 专门缓冲文本元素的全部内容，因此您不必担心它（我假设 XML::Simple 做同样的事情顺便说一句）。这种问题很难弄清楚，因为它会弄乱您用来分析它的工具，在本例中是 print。
对此很抱歉，但无论如何感谢您的帮助！我正在为不同的解析器搜索缓冲文档，但我没有找到任何关于 Twig 的特别之处。很高兴知道这不是解析器的问题！

标签： xml perl parsing

【解决方案1】：

在将 curl 解析为管道时（使用 XML::Twig->new->parse( curl -s "http://..." |），我确实得到了一些奇怪的结果：内容似乎被截断，从调用到调用的变化......

如果我解析从 curl 结果或 XML::Twig 的本机 parseurl 方法创建的文件，事情看起来会更好，然后结果是恒定的，并且是您想要的：

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig    = XML::Twig->new->parseurl( "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130" );
my $summary = $twig->first_elt( 'summary');

print $summary->text, "\n";

老实说，我不知道为什么会这样。我会尝试进一步研究它，但我怀疑我无能为力：如果问题同时出现在 XML::Simple 和 XML::Twig 中，那么它可能位于堆栈的较低级别，XML ::Parser 或 expat 以及它们与 curl 的交互。

【讨论】：

感谢您的意见！我尝试了您的两个示例，使用 parse（curl..）和 parseurl（..，但第一个不起作用，第二个也产生截断（但不变）的结果。我现在正在研究是否可以使用缓冲限制问题，http://perl-xml.sourceforge.net/faq/#char_events。我也在处理本地 xml 文件，通过 curl 下载，全文完整。
你确定你得到了全部内容吗？解析 url 时，尝试保存完整的 XML 以查看文本是否全部存在。
是的，本地xml通过XML::Twig->new->parsefile的输出和parseurl的输出是一样的。