使用 Newspaper3k 时从 html 中删除嵌入的推文答案

【问题标题】：Remove embeded tweets from html when using Newspaper3k使用 Newspaper3k 时从 html 中删除嵌入的推文
【发布时间】：2020-07-17 11:07:44
【问题描述】：

我正在使用Newspaper3k 从在线新闻中提取文本。

from newspaper import Article

urlw = 'https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959'
article = Article(urlw)
article.download()
article.parse()
string1 = article.text

但是，我可以看到有多个我不需要进行分析的嵌入式推文。我尝试将它们标识为以下内容。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]

但是，我想不出办法将它们从 string1 中删除？

【问题讨论】：

标签： python-3.x string replace

【解决方案1】：

使用美汤去除html标签；只需找到 html 标记并在 html 变量上调用 extract()。之后，使用soup对象查找文章内容

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959')
r.raise_for_status() # check for 4xx + 5xx status code
soup = BeautifulSoup(r.text, "html.parser")

for tweet in soup.find_all('div', {'element-oembed'}):
    tweet.extract() # remove div with class 'element-oembed'

articleTag = soup.find(id='article-content')
print(articleTag.text.strip())

输出：

'Traffic is backed up for about 9km after an incident near Spaghetti Junction.  The incident happened about 12pm at the Southern Motorway link to the Northwestern Motorway, westbound.  Drivers were asked to avoid the area and consider using an alternative route.     The New Zealand Transport Agency said at 1.25pm the road had reopened but traffic remained heavy between Penrose and the State Highway 1 link - a journey of about 9km.   Advertisement   Advertise with NZME.     "Consider delaying your journey if possible, or be prepared for delays."  Police have cordoned off a section of footpath on Alex Evans Road, above the motorway, in relation to the incident.'

【讨论】：

感谢您的回答。但我只想要没有推文的 HTML 页面中的文本。
第一位删除推文。之后我添加了一个更新，以通过 id 查找内容，然后获取文本（应该删除推文）。