【发布时间】:2018-12-18 16:20:32
【问题描述】:
我正在尝试从 EDGAR (SEC.gov) 上的标准 SEC 文件中提取 11 个字段,并将它们返回到一个简单的字典中。当我运行下面的代码时,其中 7 个字段工作正常,但其中 4 个字段(在代码中命名为“Director”、“Officer”、“Person”和“Ticker”)返回一个空列表值,尽管在页面上的这些字段中显示实际文本,我不知道如何解决。我使用 Chrome 中的 DevTools 获取了这些字段的 CSS 选择器信息,并在我试图抓取的页面上查看了 Elements 选项卡。需要注意的一件事是,这 4 个字段的 CSS 选择器比正常工作的选择器更长(即描述页面上位置的“树”比其他字段长)所以我觉得一定是我做错了语法- 明智地指向这 4 个字段。
作为旁注,我是 Python 新手,在处理此问题的早期,我了解到使用 Beautiful Soup,CSS 选择器引用必须使用“nth-of-type”而不是“nth-child”,所以我已经对我的代码进行了这些更改。
我不知道为什么这 4 个字段不会返回表单上显示的数据,而其他 7 个字段工作正常。任何帮助或指导将不胜感激!
注意:我使用的是 Python 3。
import bs4, requests, pprint
def getFormData(form4url):
res = requests.get(form4url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# scrape the data from each field of the SEC Form 4 document. Each field is identified by its
# CSS selector from the web page's html (viewed using DevTools -> Elements tab in Chrome)
person = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > table:nth-of-type(2) > tbody > tr > td > a')
ticker = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span.FormData')
director = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
officer = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(2) > td:nth-of-type(1)')
security = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
date = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span')
tCode = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(4)')
qtyTrans = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(6) > span.FormData')
transType = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(7) > span')
price = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(8) > span.FormData')
qtyAfter = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(9) > span')
return {'Person':person,'Ticker':ticker,'Director':director,'Officer':officer, \
'Security':security,'Date':date, 'Trans Code':tCode, 'Quantity':qtyTrans, \
'Trans Type':transType,'Price':price,'Qty After':qtyAfter}
# this is the website to scrape
userLink = 'https://www.sec.gov/Archives/edgar/data/1539638/000120919118040737/xslF345X03/doc4.xml'
dataDict = getFormData(userLink)
# following just cleans up values in dict by removing html from scraped fields (lists of
# strings), leaving only the visible text
for key,value in dataDict.items():
if len(value) > 0:
dataDict[key] = dataDict[key][0].text.strip()
pprint.pprint(dataDict)
【问题讨论】:
-
在更合适的论坛上有什么建议可以让我发布这个问题吗?
标签: python html css web-scraping beautifulsoup