抓取特定字段答案

【问题标题】：Scrape specific field抓取特定字段
【发布时间】：2021-03-13 14:48:44
【问题描述】：

以下代码在从网页中抓取字段时效果很好，但是我想在网页上再抓取一条信息（实际学习完成日期）。

我已将它添加到名为“子集”的列表的末尾，认为它会找到该字段并像与其他字段一样抓取信息。但它不是在刮这个领域吗？

我怎样才能得到这个？

（为方便参考，网址为https://clinicaltrials.gov/ct2/show/study/NCT02170532

import bs4
from collections import defaultdict
from bs4 import BeautifulSoup
import requests

def clinicalTrialsGov(nctid):
    data = defaultdict(list)
    soup = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
    subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms','actual_study_completion_date']

    for tag in soup.find_all(subset):
        data['ct{}'.format(tag.name.capitalize())].append(tag.get_text(strip=True))

    for key in data:
        print('{}: {}'.format(key, ', '.join(data[key])))

clinicalTrialsGov('NCT02170532')

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

不确定要添加到哪里。它似乎必须来自其他网址。您可以为td 选择其中has 一个具有data-term 属性且值为“学习完成日期”的孩子，然后使用相邻的同级组合器(+) 移动到关联的日期td。

from collections import defaultdict
from bs4 import BeautifulSoup as bs
import requests

def clinicalTrialsGov(nctid):
    with requests.Session() as s:
        data = defaultdict(list)
        soup = bs(s.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
        subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms','primary_completion_date']

        for tag in soup.find_all(subset):
            data['ct{}'.format(tag.name.capitalize())].append(tag.get_text(strip=True))

        for key in data:
            print('{}: {}'.format(key, ', '.join(data[key])))

        soup = bs(s.get(f'https://clinicaltrials.gov/ct2/show/study/{nctid}').text, 'lxml')
        data['actual_study_completion_date'] = soup.select_one('td:has([data-term="Study Completion Date"]) + td').text
        data['Study Start Date'] = soup.select_one('td:has([data-term="Study Start Date"]) + td').text
        data['Actual Primary Completion Date'] = soup.select_one('td:has([data-term="Primary Completion Date"]) + td').text
    return data
    
clinicalTrialsGov('NCT02170532')

【讨论】：

哇，你是个巫师，我还看到你对以前关于临床试验的帖子发表评论，所以我很高兴见到你：D 同样的方法是否适用于“研究开始日期”和“实际主要完成日期” "，如果可以，请将其包含在您的评论中，它将完成此抓取。
临床试验的绝对传奇。非常感谢你，非常感谢你仍然帮助社区，因为你上次评论是一年多以前的类似帖子:)
oof让我看看！
已修复。傻我。在测试中覆盖了错误的。对不起。
啊。我也得到了它-当我做我的版本时，出于某种原因我使用了小写字母：D。非常感谢老板！