如何从 <link/> 标签中恢复 http 链接答案

【问题标题】：How to recover http link from a <link/> tag如何从 <link/> 标签中恢复 http 链接
【发布时间】：2021-05-11 03:43:24
【问题描述】：

我正在尝试从 RSS 页面恢复网络链接。我在 Windows 10 系统上使用 Python3、请求和 BeautifulSoup4。我的代码如下：

rSS = "http://www.example.com/xml/rss/all.xml"
mYHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'}
SourcePage = requests.get(rSS, headers = mYHeaders, timeout=(5,10))
SourceText = SourcePage.text
soup = BeautifulSoup(SourceText, 'html.parser')
Articles = soup.findAll('item')
for i in Articles:
    Title = i.title
    Link = i.link
    Pub = i.pubdate
    print('Title: ', Title)
    print('Link: ', Link)
    print('Pub: ', Pub)

打印如下：

Title:  <title>There is some text here</title>
Link:  <link/>
Pub:  <pubdate>Sat, 06 Feb 2021 10:22:41 +0000</pubdate>

Articles 中的各个条目的形式如下：

<item>
<link/>https://www.example.com/news/2021/2/6/blahblah
                <title>Some title text here</title>
<description><![CDATA[Some text here&#039; and here.]]></description>
<pubdate>Sat, 06 Feb 2021 11:58:23 +0000</pubdate>
<category>News</category>
<guid ispermalink="false">https://www.example.com/?t=1234567</guid>
</item>

问题出在

<link/>

因为它没有以适当的形式捕获，即

<link>...</link>

当我在浏览器 (Firefox) 中打开相同的链接（上面的 rSS）时，链接标签会正确显示：

<item>
<link>
https://www.example.com/blah/blah
</link>
<title>
Some title text here.
</title>
<description>
Some description here.
</description>
<pubDate>Sun, 07 Feb 2021 08:03:48 +0000</pubDate>
<category>News</category>
<guid isPermaLink="false">https://www.example.com/?t=123456</guid>
</item>

我猜问题在于将 html.parser 用于 xml 页面。如果我需要使用一些 xml 解析器，你能指导我在 Python3 上使用哪一个。该代码将在树莓派上运行，但我正在 Windows10 上开发它。

提前感谢您的解决方案！

【问题讨论】：

标签： python html hyperlink

【解决方案1】：

由于<link></link>标签被转换成<link/>，你需要使用.next_sibling来获取你需要的链接。代码将如下所示：

...
for i in Articles:
    Title = i.title
    Link = i.link.next_sibling
    Pub = i.pubdate
    print('Title: ', Title)
    print('Link: ', Link)
    print('Pub: ', Pub)

此外，如果您只想获取不带标签的 Title 和 Pub，请使用 .text。

【讨论】：