【发布时间】:2021-03-13 14:48:44
【问题描述】:
以下代码在从网页中抓取字段时效果很好,但是我想在网页上再抓取一条信息(实际学习完成日期)。
我已将它添加到名为“子集”的列表的末尾,认为它会找到该字段并像与其他字段一样抓取信息。但它不是在刮这个领域吗?
我怎样才能得到这个?
(为方便参考,网址为https://clinicaltrials.gov/ct2/show/study/NCT02170532
import bs4
from collections import defaultdict
from bs4 import BeautifulSoup
import requests
def clinicalTrialsGov(nctid):
data = defaultdict(list)
soup = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms','actual_study_completion_date']
for tag in soup.find_all(subset):
data['ct{}'.format(tag.name.capitalize())].append(tag.get_text(strip=True))
for key in data:
print('{}: {}'.format(key, ', '.join(data[key])))
clinicalTrialsGov('NCT02170532')
【问题讨论】:
标签: python web-scraping beautifulsoup