【发布时间】:2021-06-11 12:19:49
【问题描述】:
我正在尝试使用 this Github 中的代码从 SEC 网站(例如 this one)中提取每个带有 BeautifulSoup 的链接。问题是我不想提取每个 8-K,而只想提取与“描述”列中的项目“2.02”匹配的那些。所以我编辑了“Download.py”文件并确定了以下内容:
while continuation_tag:
r = requests_get(browse_url, params=requests_params)
if continuation_tag == 'first pass':
logger.debug("EDGAR search URL: " + r.url)
logger.info('-' * 100)
data = r.text
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a', {'id': 'documentsbutton'}):
URL = sec_website + link['href']
linkList.append(URL)
continuation_tag = soup.find('input', {'value': 'Next ' + str(count)}) # a button labelled 'Next 100' for example
if continuation_tag:
continuation_string = continuation_tag['onclick']
browse_url = sec_website + re.findall('cgi-bin.*count=\d*', continuation_string)[0]
requests_params = None
return linkList
我尝试添加另一个循环来匹配我的正则表达式,但它不起作用
for link in soup.find_all('a', {'id': 'documentsbutton'}):
for link in soup.find_all(string=re.compile("items 2.02")):
URL = sec_website + link['href']
linkList.append(URL)
任何帮助将不胜感激,谢谢!
【问题讨论】:
标签: python regex web-scraping beautifulsoup