【发布时间】:2020-07-17 11:07:44
【问题描述】:
我正在使用Newspaper3k 从在线新闻中提取文本。
from newspaper import Article
urlw = 'https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959'
article = Article(urlw)
article.download()
article.parse()
string1 = article.text
但是,我可以看到有多个我不需要进行分析的嵌入式推文。我尝试将它们标识为以下内容。
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12307959')
soup = BeautifulSoup(r.content, "html.parser")
article_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
但是,我想不出办法将它们从 string1 中删除?
【问题讨论】:
标签: python-3.x string replace