如何使用 BeautifulSoup 抓取非 HTML 标签答案

【问题标题】：How to scrape non HTML tags using BeautifulSoup如何使用 BeautifulSoup 抓取非 HTML 标签
【发布时间】：2020-04-06 17:58:18
【问题描述】：

例如，我正在尝试从标签为<a&#32;href="https: evisa.mfa.am "> 的网站中删除数据，请查看此website

BeautifulSoup 有没有办法从非 html 标签中提取数据？

这是来自上述链接的整个 html 页面的 sn-p

<br/>2.&#32;Airlines&#32;must&#32;provide&#32;advance&#32;passenger&#32;information&#32;of&#32;scheduled&#32;arrival&#32;of&#32;nationals&#32;of&#32;Antigua&#32;and&#32;Barbuda&#32;and&#32;resident&#32;diplomats.&#32;<br/><br/><b>ARGENTINA</b>&#32;-&#32;published&#32;02.04.2020&#32;<br/>Passengers&#32;are&#32;not&#32;allowed&#32;to&#32;enter&#32;Argentina&#32;until&#32;12&#32;April&#32;2020.<br/><br/><b>ARMENIA</b>&#32;-&#32;published&#32;22.03.2020&#32;<br/>1.&#32;Nationals&#32;of&#32;China&#32;(People's&#32;Rep.)&#32;with&#32;a&#32;normal&#32;passport&#32;are&#32;no&#32;longer&#32;visa&#32;exempt.&#32;<br/>2.&#32;Nationals&#32;of&#32;Iran&#32;can&#32;no&#32;longer&#32;obtain&#32;a&#32;visa&#32;on&#32;arrival.&#32;They&#32;must&#32;obtain&#32;a&#32;visa&#32;or&#32;an&#32;e-visa&#32;prior&#32;to&#32;their&#32;arrival&#32;in&#32;Armenia.&#32;The&#32;e-visa&#32;can&#32;be&#32;obtained&#32;at&#32;<a&#32;href="https://evisa.mfa.am/">https://evisa.mfa.am/</a>&#32;<br/>3.&#32;Passengers&#32;who&#32;have&#32;been&#32;in&#32;Austria,&#32;Belgium,&#32;China&#32;(People's&#32;Rep.),&#32;Denmark,&#32;France,&#32;Germany,&#32;Iran,&#32;Italy,&#32;Japan,&#32;Korea&#32;(Rep.),&#32;Netherlands,&#32;Norway,&#32;Spain,&#32;Sweden,&#32;Switzerland&#32;or&#32;United&#32;Kingdom&#32;in&#32;the&#32;past&#32;14&#32;days&#32;are&#32;not&#32;allowed&#32;to&#32;enter&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;nationals&#32;or&#32;residents&#32;of&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;spouses&#32;or&#32;children&#32;of&#32;nationals&#32;of&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;employees&#32;of&#32;foreign&#32;diplomatic&#32;missions&#32;and&#32;consular&#32;institutions.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;representations&#32;of&#32;official&#32;international&#32;missions&#32;or&#32;organizations.<br/>4.&#32;Nationals&#32;of&#32;Armenia&#32;who&#32;have&#32;been&#32;in&#32;Austria,&#32;Belgium,&#32;China&#32;(People's&#32;Rep.),&#32;Denmark,&#32;France,&#32;Germany,&#32;Iran,&#32;Italy,&#32;Japan,&#32;Korea&#32;(Rep.),&#32;Netherlands,&#32;Norway,&#32;Spain,&#32;Sweden,&#32;Switzerland&#32;or&#32;United&#32;Kingdom&#32;in&#32;the&#32;past&#32;14&#32;days&#32;must&#32;undergo&#32;14-days&#32;of&#32;quarantine&#32;or&#32;self-isolation&#32;regime.

【问题讨论】：

请提供输入和预期输出的示例！
@αԋɱҽԃαмєяιcαη 我已经添加了一个小的 html 代码的 sn-p，其中包含一个非 html 标记，但是您仍然更喜欢查看整个 html 页面源，您可以参考问题中网站的链接
检查下面的答案

标签： python beautifulsoup

【解决方案1】：

那叫AMP chars，你可以看看here 了解它是什么。

不要使用html.parser。只需使用真实的parser，例如lxml 或html5lib

from bs4 import BeautifulSoup
import requests

r = requests.get(
    "https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm")
soup = BeautifulSoup(r.content, 'html5lib')


print(soup.prettify())

【讨论】：

然后我必须安装html5lib 解析器库吗？如果有，怎么做？
我更新了我接受的答案，因为 @Philip 的方法也成功了，但它以文本形式返回所有内容，而 @αԋɱҽԃ αмєяιcαη 的方法让它成为一个汤对象，以后可以用来进一步解析其他标签。但是感谢你们俩:)
一个简单的问题，你怎么知道它是否应该由调节器 html.parser 或 html5lib 解析？
@AmanSingh 那是根据网站结构，众所周知html5lib 会转义AMP 字符。 check

【解决方案2】：

如果您使用requests 解析网页，请删除标签中错误的部分，您可以将其传递给 BeautifulSoup。

在下面我将替换&#32;，因为它只是一个空间的HTML 表示。

import requests
url = 'https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm'

response = requests.get(url)
content = response.text.replace('&#32;',' ')

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

现在您也可以像使用 BeautifulSoup 一样使用它了。

【讨论】：

非常感谢，我从来没有遇到过这样的html标签，学习了，再次感谢
那个答案完全无效。如果有多个转义字符怎么办？你会做什么？只是会替换每一个字符？
@αԋɱҽԃαмєяιcαη。这根本不是一个无效的答案，因为它实际上满足了 OP 的要求。当我浏览网站的 HTML 时，我注意到了这种模式并对其进行了测试。如果还有其他问题，我不会提出我的解决方案。还有其他方法可以做到这一点吗？当然，你的看起来也很棒。
@Philip 好吧，OP 只是给你一个他正在处理的数据的例子。所以他不会带着AMP的每一个字符来问你如何逃脱它。您不会手动替换每个字符。即使您将在每个页面上的 AMP 不同的多个页面上进行抓取。所以在这种情况下，Technically 你将不得不自动避免所有这些。还有那部分looks like the site has added to avoid scraping 听起来方向错误。
@Philip 这里是关于为什么在AMP 中使用HTML 的旁注

【解决方案3】：

您必须在发布问题之前分析您的 html 代码。

现在尝试获取您的网址

from bs4 import BeautifulSoup

with open("test.html","r") as f:
    page = f.read()
    soup = BeautifulSoup(page, 'html.parser')
    url = soup.findAll("a&#32;href=\"https:")
    print(url)

【讨论】：

你的方法没有什么新东西，我没有尝试过。这没有帮助
test.html 是你的页面，我已经下载了它，我已经解析了所有标签之间的所有标签，你必须执行一些功能，如 Split() 和 strip() 之类的东西才能清除你的 URL