【问题标题】:Beautiful Soup web scraping with CSS selectors使用 CSS 选择器进行漂亮的 Soup 网页抓取
【发布时间】:2018-12-18 16:20:32
【问题描述】:

我正在尝试从 EDGAR (SEC.gov) 上的标准 SEC 文件中提取 11 个字段,并将它们返回到一个简单的字典中。当我运行下面的代码时,其中 7 个字段工作正常,但其中 4 个字段(在代码中命名为“Director”、“Officer”、“Person”和“Ticker”)返回一个空列表值,尽管在页面上的这些字段中显示实际文本,我不知道如何解决。我使用 Chrome 中的 DevTools 获取了这些字段的 CSS 选择器信息,并在我试图抓取的页面上查看了 Elements 选项卡。需要注意的一件事是,这 4 个字段的 CSS 选择器比正常工作的选择器更长(即描述页面上位置的“树”比其他字段长)所以我觉得一定是我做错了语法- 明智地指向这 4 个字段。

作为旁注,我是 Python 新手,在处理此问题的早期,我了解到使用 Beautiful Soup,CSS 选择器引用必须使用“nth-of-type”而不是“nth-child”,所以我已经对我的代码进行了这些更改。

我不知道为什么这 4 个字段不会返回表单上显示的数据,而其他 7 个字段工作正常。任何帮助或指导将不胜感激!

注意:我使用的是 Python 3。

import bs4, requests, pprint

def getFormData(form4url):
    res = requests.get(form4url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # scrape the data from each field of the SEC Form 4 document. Each field is identified by its
    # CSS selector from the web page's html (viewed using DevTools -> Elements tab in Chrome)
    person = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > table:nth-of-type(2) > tbody > tr > td > a')
    ticker = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span.FormData')
    director = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
    officer = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(2) > td:nth-of-type(1)')
    security = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
    date = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span')
    tCode = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(4)')
    qtyTrans = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(6) > span.FormData')
    transType = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(7) > span')
    price = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(8) > span.FormData')
    qtyAfter = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(9) > span')

    return {'Person':person,'Ticker':ticker,'Director':director,'Officer':officer, \
            'Security':security,'Date':date, 'Trans Code':tCode, 'Quantity':qtyTrans, \
            'Trans Type':transType,'Price':price,'Qty After':qtyAfter}

# this is the website to scrape
userLink = 'https://www.sec.gov/Archives/edgar/data/1539638/000120919118040737/xslF345X03/doc4.xml'
dataDict = getFormData(userLink)

# following just cleans up values in dict by removing html from scraped fields (lists of
# strings), leaving only the visible text   
for key,value in dataDict.items():
    if len(value) > 0:
        dataDict[key] = dataDict[key][0].text.strip()      

pprint.pprint(dataDict)

【问题讨论】:

  • 在更合适的论坛上有什么建议可以让我发布这个问题吗?

标签: python html css web-scraping beautifulsoup


【解决方案1】:

PersonTickerDirectorOfficer 的正确 CSS 选择器是:

person: "table:nth-of-type(2) > tr > td > table"
ticker: "table:nth-of-type(2) > tr > td:nth-of-type(2) > span:nth-of-type(2)"
director: "table:nth-of-type(2) > tr > td:nth-of-type(3) > table > tr > td"
officer: "table:nth-of-type(2) > tr > td:nth-of-type(3) > table > tr:nth-of-type(2) > td"

这是使用 Node.js 的演示,x-ray,以及您提供的示例链接:https://codesandbox.io/s/j489wlyzmw

该演示不会为Officer 返回任何值,因为Officer 未设置。

【讨论】:

    猜你喜欢
    • 2022-01-08
    • 2022-01-20
    • 2019-02-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-06-29
    • 2021-05-29
    相关资源
    最近更新 更多