【问题标题】:Use OR in Lambda function - Web Scraping Python在 Lambda 函数中使用 OR - Web Scraping Python
【发布时间】:2021-11-12 03:18:21
【问题描述】:

使用这个例子 - How to extract html links with a matching word from a website using python

我编写了一个网络抓取脚本来在当地报纸的最新版本和兑现版本中查找关键字。

from bs4 import BeautifulSoup
import requests

urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
        'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
        'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
        'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
        'https://web.archive.org/web/20191111061843/https://www.marinij.com/']

dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']

for i, (url,date) in enumerate(zip(urls,dates)):
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())
    
    results = soup.find_all(covid_links)

    num_art = str((len(results)))
    if not results:
        results = ["The term COVID did not appear in the headlines this quarter!\n"]

    textfile = open("marin_covid_" + date + ".txt", "w")
    for idx, element in enumerate(results):
        element = str(element)
        # print(element)
        if idx == 0:
            textfile.write(date + "\n" + "Number of articles = " + num_art + "\n" + "\n" + element + "\n")

        else:
            textfile.write(element + "\n" + "\n")
    textfile.close()

files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
        'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']

with open("COVID_articles_in_MIJ.txt", "w") as outfile:
    for filename in files:
        print(filename)
        with open(filename) as infile:
            contents = infile.read()
            outfile.write(contents)

仅使用 1 个关键字时效果非常好,但当我尝试使用“或”函数查找 1 个或多个关键字时,它只搜索第一个单词。这可以通过切换示例中的 2 个关键字来复制 - “covid”和“corona”。

我知道问题出在这个 lambda 函数上,但我不知道如何解决。

    covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('corona' or 'covid') in tag.get_text().lower())

如果您安装了先决条件,此代码应该是完全可执行的,感谢所有帮助。

【问题讨论】:

  • 表达式('corona' or 'covid') 的计算结果为'corona',所以这就是要搜索的全部内容。在in 运算符的左侧根本没有任何东西可以用来搜索多个值;你必须把它写成(('corona' in X) or ('covid' in X))
  • 您似乎不了解 Python 中的操作顺序。 ('corona' or 'covid') 评估为“corona”,因此它会检查“corona”是否在 tag.get_text().lower() 中。 tag.attrs and ('corona' in tag.get_text().lower() or 'covid' in tag.get_text().lower()) 也是如此
  • 这实际上很有帮助,虽然你可以不那么粗鲁¯_(ツ)_/¯

标签: python web-scraping lambda beautifulsoup python-requests


【解决方案1】:

正如 cmets 中指出的那样,问题是“in”运算符必须包含在“or”运算符的任一侧,以便评估属性;在这种情况下, tag.get_text().lower() 可以针对两种情况进行评估 - “corona”和“covid”。正确的 lambda 函数是这样的:

covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                            'href' in tag.attrs and
                            ('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-01-05
    • 2019-01-06
    • 2017-10-31
    • 2018-08-24
    • 1970-01-01
    • 2016-10-04
    • 2018-04-23
    • 1970-01-01
    相关资源
    最近更新 更多