使用 CSS 选择器进行漂亮的 Soup 网页抓取答案

【问题标题】：Beautiful Soup web scraping with CSS selectors使用 CSS 选择器进行漂亮的 Soup 网页抓取
【发布时间】：2018-12-18 16:20:32
【问题描述】：

我正在尝试从 EDGAR (SEC.gov) 上的标准 SEC 文件中提取 11 个字段，并将它们返回到一个简单的字典中。当我运行下面的代码时，其中 7 个字段工作正常，但其中 4 个字段（在代码中命名为“Director”、“Officer”、“Person”和“Ticker”）返回一个空列表值，尽管在页面上的这些字段中显示实际文本，我不知道如何解决。我使用 Chrome 中的 DevTools 获取了这些字段的 CSS 选择器信息，并在我试图抓取的页面上查看了 Elements 选项卡。需要注意的一件事是，这 4 个字段的 CSS 选择器比正常工作的选择器更长（即描述页面上位置的“树”比其他字段长）所以我觉得一定是我做错了语法- 明智地指向这 4 个字段。

作为旁注，我是 Python 新手，在处理此问题的早期，我了解到使用 Beautiful Soup，CSS 选择器引用必须使用“nth-of-type”而不是“nth-child”，所以我已经对我的代码进行了这些更改。

我不知道为什么这 4 个字段不会返回表单上显示的数据，而其他 7 个字段工作正常。任何帮助或指导将不胜感激！

注意：我使用的是 Python 3。

import bs4, requests, pprint

def getFormData(form4url):
    res = requests.get(form4url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # scrape the data from each field of the SEC Form 4 document. Each field is identified by its
    # CSS selector from the web page's html (viewed using DevTools -> Elements tab in Chrome)
    person = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > table:nth-of-type(2) > tbody > tr > td > a')
    ticker = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span.FormData')
    director = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
    officer = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(2) > td:nth-of-type(1)')
    security = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
    date = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span')
    tCode = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(4)')
    qtyTrans = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(6) > span.FormData')
    transType = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(7) > span')
    price = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(8) > span.FormData')
    qtyAfter = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(9) > span')

    return {'Person':person,'Ticker':ticker,'Director':director,'Officer':officer, \
            'Security':security,'Date':date, 'Trans Code':tCode, 'Quantity':qtyTrans, \
            'Trans Type':transType,'Price':price,'Qty After':qtyAfter}

# this is the website to scrape
userLink = 'https://www.sec.gov/Archives/edgar/data/1539638/000120919118040737/xslF345X03/doc4.xml'
dataDict = getFormData(userLink)

# following just cleans up values in dict by removing html from scraped fields (lists of
# strings), leaving only the visible text   
for key,value in dataDict.items():
    if len(value) > 0:
        dataDict[key] = dataDict[key][0].text.strip()      

pprint.pprint(dataDict)

【问题讨论】：

在更合适的论坛上有什么建议可以让我发布这个问题吗？

标签： python html css web-scraping beautifulsoup

【解决方案1】：

Person、Ticker、Director 和 Officer 的正确 CSS 选择器是：

person: "table:nth-of-type(2) > tr > td > table"
ticker: "table:nth-of-type(2) > tr > td:nth-of-type(2) > span:nth-of-type(2)"
director: "table:nth-of-type(2) > tr > td:nth-of-type(3) > table > tr > td"
officer: "table:nth-of-type(2) > tr > td:nth-of-type(3) > table > tr:nth-of-type(2) > td"

这是使用 Node.js 的演示，x-ray，以及您提供的示例链接：https://codesandbox.io/s/j489wlyzmw

该演示不会为Officer 返回任何值，因为Officer 未设置。

【讨论】：