【问题标题】:BeautifulSoup: Get the HTML Code of Modal FooterBeautifulSoup:获取模态页脚的 HTML 代码
【发布时间】:2021-09-30 16:48:31
【问题描述】:

我是 Python 中 Web 抓取的新手,我尝试从 SEC Edgar 全文搜索中抓取所有 htm 文档链接。我可以在 Modal Footer 中看到链接,但 BeautifulSoup 不会解析带有链接的 href 元素。

是否有一个简单的解决方案来解析文档的链接?

url = 'https://www.sec.gov/edgar/search/#/q=ex10&category=custom&forms=10-K%252C10-Q%252C8-K'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
    
for a in soup.find_all(id = "open-file"):
    print(a)

【问题讨论】:

    标签: python html-parsing edgar


    【解决方案1】:

    该数据是使用 javascript 动态加载的。有很多关于刮这种页面的信息(see one of many examples here);在这种情况下,以下内容应该可以帮助您:

    import requests
    import json
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0',
        'Accept': 'application/json, text/javascript, */*; q=0.01',   
    }
    
    data = '{"q":"ex10","category":"custom","forms":["10-K","10-Q","8-K"],"startdt":"2020-10-08","enddt":"2021-10-08"}'
    #obvioulsy, you need to change "startdt" and "enddt" as necessary
    response = requests.post('https://efts.sec.gov/LATEST/search-index', headers=headers, data=data)
    

    响应为 json 格式。您的网址隐藏在其中:

    data = json.loads(response.text)
    hits = data['hits']['hits']
    for hit in hits:
        cik = hit['_source']['ciks'][0]
        file_data = hit['_id'].split(":")
        filing = file_data[0].replace('-','')
        file_name = file_data[1]
        url = f'https://www.sec.gov/Archives/edgar/data/{cik}/{filing}/{file_name}'
        print(url)
    

    输出:

    https://www.sec.gov/Archives/edgar/data/0001372183/000158069520000415/ex10-5.htm
    https://www.sec.gov/Archives/edgar/data/0001372183/000138713120009670/ex10-5.htm
    https://www.sec.gov/Archives/edgar/data/0001540615/000154061520000006/ex10.htm
    https://www.sec.gov/Archives/edgar/data/0001552189/000165495421004948/ex10-1.htm
    

    等等

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-01-02
      • 2012-02-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-07-08
      • 1970-01-01
      相关资源
      最近更新 更多