【问题标题】:Beautifulsoup : Unable to extract href with several conditionsBeautifulsoup:无法在多个条件下提取 href
【发布时间】:2021-06-11 12:19:49
【问题描述】:

我正在尝试使用 this Github 中的代码从 SEC 网站(例如 this one)中提取每个带有 BeautifulSoup 的链接。问题是我不想提取每个 8-K,而只想提取与“描述”列中的项目“2.02”匹配的那些。所以我编辑了“Download.py”文件并确定了以下内容:

    while continuation_tag:
        r = requests_get(browse_url, params=requests_params)
        if continuation_tag == 'first pass':
            logger.debug("EDGAR search URL: " + r.url)
            logger.info('-' * 100)
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        for link in soup.find_all('a', {'id': 'documentsbutton'}):   
            URL = sec_website + link['href']
            linkList.append(URL)
        continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
        if continuation_tag:
            continuation_string = continuation_tag['onclick']
            browse_url = sec_website + re.findall('cgi-bin.*count=\d*', continuation_string)[0]
            requests_params = None
    return linkList

我尝试添加另一个循环来匹配我的正则表达式,但它不起作用

for link in soup.find_all('a', {'id': 'documentsbutton'}):
    for link in soup.find_all(string=re.compile("items 2.02")):
        URL = sec_website + link['href']
        linkList.append(URL)

任何帮助将不胜感激,谢谢!

【问题讨论】:

    标签: python regex web-scraping beautifulsoup


    【解决方案1】:

    首先找到封装a标签的tr和包含items 2.02文本的td标签。然后找到tr中的url,如果td实际上包含文本items 2.02

    for link in soup.find_all("tr"):
        td = link.find('td', {'class': 'small'})
        if td:
            if 'items 2.02' in td.text:
                URL = sec_website + link.find('a', {'id': 'documentsbutton'})['href']
                linkList.append(URL)
    

    【讨论】:

      【解决方案2】:

      您可以使用css pseudo classes 编写更简洁的内容。以下查找td 子元素,其父类为tableFile2,具有相邻的兄弟td(即下一列),它既是表的第三列(nth-of-type),又包含2.02;从那些 tds 过滤到具有 id documentsbutton 的子 a 标签。

      import requests 
      from bs4 import BeautifulSoup as bs # version 4.7.1 +
      
      base = 'https://www.sec.gov'
      r = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=8-K&dateb=&owner=exclude&start=0&count=40')
      soup = bs(r.content, 'lxml') # or html.parser
      links = [base + i['href'] for i in soup.select('.tableFile2  td:has(+ td:nth-of-type(3):contains("2.02")) #documentsbutton')]
      

      【讨论】:

        猜你喜欢
        • 2017-11-10
        • 2021-12-26
        • 1970-01-01
        • 2020-07-16
        • 2012-12-24
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-01-28
        相关资源
        最近更新 更多