Beautifulsoup：无法在多个条件下提取 href答案

【问题标题】：Beautifulsoup : Unable to extract href with several conditionsBeautifulsoup：无法在多个条件下提取 href
【发布时间】：2021-06-11 12:19:49
【问题描述】：

我正在尝试使用 this Github 中的代码从 SEC 网站（例如 this one）中提取每个带有 BeautifulSoup 的链接。问题是我不想提取每个 8-K，而只想提取与“描述”列中的项目“2.02”匹配的那些。所以我编辑了“Download.py”文件并确定了以下内容：

    while continuation_tag:
        r = requests_get(browse_url, params=requests_params)
        if continuation_tag == 'first pass':
            logger.debug("EDGAR search URL: " + r.url)
            logger.info('-' * 100)
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        for link in soup.find_all('a', {'id': 'documentsbutton'}):   
            URL = sec_website + link['href']
            linkList.append(URL)
        continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
        if continuation_tag:
            continuation_string = continuation_tag['onclick']
            browse_url = sec_website + re.findall('cgi-bin.*count=\d*', continuation_string)[0]
            requests_params = None
    return linkList

我尝试添加另一个循环来匹配我的正则表达式，但它不起作用

for link in soup.find_all('a', {'id': 'documentsbutton'}):
    for link in soup.find_all(string=re.compile("items 2.02")):
        URL = sec_website + link['href']
        linkList.append(URL)

任何帮助将不胜感激，谢谢！

【问题讨论】：

标签： python regex web-scraping beautifulsoup

【解决方案1】：

首先找到封装a标签的tr和包含items 2.02文本的td标签。然后找到tr中的url，如果td实际上包含文本items 2.02：

for link in soup.find_all("tr"):
    td = link.find('td', {'class': 'small'})
    if td:
        if 'items 2.02' in td.text:
            URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
            linkList.append(URL)

【讨论】：

【解决方案2】：

您可以使用css pseudo classes 编写更简洁的内容。以下查找td 子元素，其父类为tableFile2，具有相邻的兄弟td（即下一列），它既是表的第三列（nth-of-type），又包含2.02；从那些 tds 过滤到具有 id documentsbutton 的子 a 标签。

import requests 
from bs4 import BeautifulSoup as bs # version 4.7.1 +

base = 'https://www.sec.gov'
r = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=8-K&dateb=&owner=exclude&start=0&count=40')
soup = bs(r.content, 'lxml') # or html.parser
links = [base + i['href'] for i in soup.select('.tableFile2  td:has(+ td:nth-of-type(3):contains("2.02")) #documentsbutton')]

【讨论】：