【发布时间】:2021-11-12 03:18:21
【问题描述】:
使用这个例子 - How to extract html links with a matching word from a website using python
我编写了一个网络抓取脚本来在当地报纸的最新版本和兑现版本中查找关键字。
from bs4 import BeautifulSoup
import requests
urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
'https://web.archive.org/web/20191111061843/https://www.marinij.com/']
dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']
for i, (url,date) in enumerate(zip(urls,dates)):
r = requests.get(url)
soup = BeautifulSoup(r.content)
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
results = soup.find_all(covid_links)
num_art = str((len(results)))
if not results:
results = ["The term COVID did not appear in the headlines this quarter!\n"]
textfile = open("marin_covid_" + date + ".txt", "w")
for idx, element in enumerate(results):
element = str(element)
# print(element)
if idx == 0:
textfile.write(date + "\n" + "Number of articles = " + num_art + "\n" + "\n" + element + "\n")
else:
textfile.write(element + "\n" + "\n")
textfile.close()
files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']
with open("COVID_articles_in_MIJ.txt", "w") as outfile:
for filename in files:
print(filename)
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
仅使用 1 个关键字时效果非常好,但当我尝试使用“或”函数查找 1 个或多个关键字时,它只搜索第一个单词。这可以通过切换示例中的 2 个关键字来复制 - “covid”和“corona”。
我知道问题出在这个 lambda 函数上,但我不知道如何解决。
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
如果您安装了先决条件,此代码应该是完全可执行的,感谢所有帮助。
【问题讨论】:
-
表达式
('corona' or 'covid')的计算结果为'corona',所以这就是要搜索的全部内容。在in运算符的左侧根本没有任何东西可以用来搜索多个值;你必须把它写成(('corona' in X) or ('covid' in X))。 -
您似乎不了解 Python 中的操作顺序。
('corona' or 'covid')评估为“corona”,因此它会检查“corona”是否在 tag.get_text().lower() 中。tag.attrs and ('corona' in tag.get_text().lower() or 'covid' in tag.get_text().lower())也是如此 -
这实际上很有帮助,虽然你可以不那么粗鲁¯_(ツ)_/¯
标签: python web-scraping lambda beautifulsoup python-requests