【发布时间】:2011-09-09 04:38:30
【问题描述】:
我尝试使用 XML::Simple 和 XML::Twig 解析 XML 文件,结果相同。文件中的其他字段工作得很好。
有问题的文件可以在这里检索:
curl -s "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130"
这是解析器问题还是文件问题?两个解析器的输出相同。字符串中的 HTML 标记存储在 XML 中
输入字段(在名为“summary”的 xml-tags 内):
<summary type="html"><p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite. Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak immune systems and babies whose mothers become infected for the first time during pregnancy. Problems can include damage to the brain, eyes and other organs.</p>
^I
<p>You can get toxoplasmosis from </p>
<ul>
<li>^IWaste from an infected cat</li>
<li>^IEating contaminated meat that is raw or not well cooked </li>
<li>^IUsing utensils or cutting boards after they've had contact with raw meat </li>
<li>^IDrinking infected water </li>
<li>^IReceiving an infected organ transplant or blood transfusion</li>
</ul>
<p>Most people with toxoplasmosis don't need treatment. There are drugs to treat it for pregnant women and people with weak immune systems. </p>

<p class="NLMattribution">Centers for Disease Control and Prevention</p></summary>
XML解析后的输出:
<p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite. Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak im<p class="NLMattribution">Centers for Disease Control and Prevention</p>to treat it for pregnant women and people with weak immune systems. </p>her organs.</p>
问题的解决方案: XML 文件包含一个回车符“ " 这会导致解析器出现问题。下载 XML 文件后,我删除了回车符,并带有以下行:
sed -i 's/
//g' *.xml
解析器现在按预期工作。
更新: 回车不会影响解析器,只会影响被截断和混淆的输出。然而,删除它确实解决了我的问题。
【问题讨论】:
-
如果您知道解决方案,请关闭问题...
-
其实是 字符不会给解析器带来问题。我怀疑它们会在您打印结果时引起问题。特别是如果您正在使用 Unix 机器。如果将结果输出到文件,您应该能够看到整个文本,包括一些 ^M 字符,这些字符在您打印时看起来像是缺少部分文本。但是,如果没有看到您的代码,很难判断。
-
是的,这似乎是对的,mirod。打印输出是错误的,一些部分被移除,其他部分在其他部分之间。我已经用这个信息更新了帖子。
-
你让我担心了一段时间! XML::Twig 专门缓冲文本元素的全部内容,因此您不必担心它(我假设 XML::Simple 做同样的事情顺便说一句)。这种问题很难弄清楚,因为它会弄乱您用来分析它的工具,在本例中是 print。
-
对此很抱歉,但无论如何感谢您的帮助!我正在为不同的解析器搜索缓冲文档,但我没有找到任何关于 Twig 的特别之处。很高兴知道这不是解析器的问题!