Beautiful Soup 使 XML 数据不完整答案

【问题标题】：Beautiful Soup is getting XML data incompleteBeautiful Soup 使 XML 数据不完整
【发布时间】：2016-06-08 08:57:40
【问题描述】：

我正在使用 Python3.4 和 Beautiful Soup 4 来获取 RSS XML 提要的一些数据。一切似乎都运行良好，但有时它的行为不如预期，因为没有从列表中的至少一项中获取 <description> 标记内的所有数据。
例如，这是给我带来问题的项目：

<item>
    <title>Google&#8217;s first DeepMind AI health project is missing something</title>
    <link>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/</link>
    <comments>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/#respond</comments>
    <pubDate>Thu, 25 Feb 2016 11:36:56 +0000</pubDate>
    <dc:creator><![CDATA[Kirsty Styles]]></dc:creator>
            <category><![CDATA[Google]]></category>
    <category><![CDATA[Insider]]></category>
    <category><![CDATA[Deepmind]]></category>
    <category><![CDATA[doctor]]></category>
    <category><![CDATA[healthcare]]></category>
    <category><![CDATA[NHS]]></category>
    <category><![CDATA[UK]]></category>

    <guid isPermaLink="false">http://thenextweb.com/?p=957096</guid>
    <description><![CDATA[<img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" alt="Doctors Seek Higher Fees From Health Insurers" title="Google&#039;s first DeepMind AI health project is missing something" data-id="750745" /><br />Having been down at Google’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London&#8230; <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&#038;utm_medium=feed&#038;utm_campaign=profeed">This story continues</a> at The Next Web]]></description>
    <wfw:commentRss>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/feed/</wfw:commentRss>
    <slash:comments>0</slash:comments>
<enclosure url="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" type="image/jpeg" length="0" />
</item>

我正在使用这段代码来解析数据：

from bs4 import BeautifulSoup
import urllib.request

req = urllib.request.urlopen('http://thenextweb.com/feed/')

xml = BeautifulSoup(req, 'xml')

for item in xml.findAll('item'):
    string = item.description.string
    #new_string = string.split('/>', 1)
    #print(new_string[0]+'/><p>')
    print(string)

当我运行脚本时一切正常，但是那个特定的项目失败了。代码中的注释行用于拆分img并添加<p>标签以对内容进行排序。

我从那个项目得到的结果是：

’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London&#8230; <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&#038;utm_medium=feed&#038;utm_campaign=profeed">This story continues</a> at The Next Web

我不知道发生了什么。如果有人可以帮助我或指导我提取确切的<img> 标签，我将非常感激。

【问题讨论】：

通过this 快速猜测：首先import html。然后尝试string = html.unescape(item.description.string)。需要检查bs4 API，但您可能还需要string = html.unescape(item.description.text)。

标签： python xml parsing python-3.x beautifulsoup

【解决方案1】：

为什么不在你的 for 循环中搜索 description 标记，如下所示：

for item in xml.findAll('item'):
    s = item.find('description')
    print (s)
    >>> <description>&lt;img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2016/02/shutterstock_366588536-520x245.jpg" alt="Fintech" title="5 British companies for FinTech Week" data-id="956789" /&gt;&lt;br /&gt;FinTech, financial technology, is about disrupting the stale financial sector with technology and innovation. Have you accepted the status quo of a bank-led dominance? The people in the flourishing FinTech field have rejected it. Last year, Eileen Burbidge, the UK government’s special envoy for FinTech stated: “London and the UK will lead the FinTech sector.” That’s not hard to believe. With a well-established financial sector, a cultivated tech scene and wide access to capital and talent, London is primed for FinTech. The industry generated over $9 billion in revenue last year. As the UK celebrates #FinTechWeek, we look at five British&amp;#8230; &lt;br&gt;&lt;br&gt;&lt;a href="http://thenextweb.com/insider/2016/02/25/5-british-companies-for-fintech-week/?utm_source=social&amp;#038;utm_medium=feed&amp;#038;utm_campaign=profeed"&gt;This story continues&lt;/a&gt; at The Next Web</description>

【讨论】：