Webscrape 使用 BeautifulSoup 到 Dataframe答案

【问题标题】：Webscrape using BeautifulSoup to DataframeWebscrape 使用 BeautifulSoup 到 Dataframe
【发布时间】：2020-08-21 19:39:19
【问题描述】：

这是html代码：

    <div class="wp-block-atomic-blocks-ab-accordion ab-block-accordion ab-font-size-18"><details><summary class="ab-accordion-title"><strong>American Samoa</strong></summary><div class="ab-accordion-text">
    <ul><li><strong><a href="https://www.americansamoa.gov/covid-19-advisories" target="_blank" rel="noreferrer noopener" aria-label="American Samoa Department of Health Travel Advisory (opens in a new tab)">American Samoa Department of Health Travel Advisory</a></strong></li><li>March 2, 2020—Governor&nbsp;Moliga&nbsp;<a rel="noreferrer noopener" href="https://www.rnz.co.nz/international/pacific-news/410783/american-samoa-establishes-govt-taskforce-to-plan-for-coronavirus" target="_blank">appointed</a>&nbsp;a government taskforce&nbsp;to provide a plan for preparation and response to the covid-19 coronavirus.&nbsp;</li></ul>
    
    <ul><li>March 25, 2020 – The Governor <a href="https://6fe16cc8-c42f-411f-9950-4abb1763c703.filesusr.com/ugd/4bfff9_2d3c78a841824b8aafe05032f853585b.pdf">issued</a> an Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health. The order requires the immediate and comprehensive enforcement by the Commissioner of Public Safety, Director of Health, Attorney General, and other agency leaders.
    <ul>
    <li>Business are also required to provide necessary supplies to the public and are prohibited from price gouging.</li>
    </ul>
    </li></ul>
</div></details></div>

我想提取状态、日期和文本并添加到包含这三列的数据框中

州：美属萨摩亚
日期：2020-03-25
文本：州长 001 号行政命令承认宣布的公共卫生紧急状态和紧急状态，以及对公共卫生的迫在眉睫的威胁

到目前为止我的代码：

soup = bs4.BeautifulSoup(data)
for tag in soup.find_all("summary"):
    print("{0}: {1}".format(tag.name, tag.text))
    for tag1 in soup.find_all("li"):
        #print(type(tag1))
        ln = tag1.text
        dt = (ln.split(' – ')[0])
        dt = (dt.split('—')[0])
        #txt = ln.split(' – ')[1]
        print(dt)

需要帮助：

如何获取文本直到“。”只是，我不需要整个测试
如何在循环时将数据框添加为新行（如果网页的源代码，我只附加了一部分）

感谢您的帮助！

【问题讨论】：

你能检查你的 html 的准确性吗？比如<details>和<div class="ab-accordion-text">在哪里关闭？
不会在任何地方关闭，我已经更新了
的 HTML 代码
好。如果有多个项目，它是在<details> 下显示为另一个<summary> 等还是作为另一组<details>？
是的，我正在阅读它的一个巨大网页，另一部分开始 -
加利福尼亚

covid19.ca.gov" target="_blank">加州冠状病毒资源页面。
你能把你要抓取的页面的网址贴出来吗？

from bs4 import BeautifulSoup import requests import re import pandas as pd HEADERS = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0', } # You need to specify User Agent headers or else you get a 403 data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text soup = BeautifulSoup(data, 'lxml') rows_list = [] for detail in soup.find_all("details"): state = detail.find('summary') ul = detail.find('ul') for li in ul.find_all('li', recursive=False): # Three types of hyphen are used on this webpage split = re.split('(?:-|–|—)', li.text, 1) if len(split) == 2: rows_list.append([state.text, split[0], split[1]]) else: print("Error", li.text) df = pd.DataFrame(rows_list) with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1): print(df)

from bs4 import BeautifulSoup import requests import re import pandas as pd HEADERS = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0', } # You need to specify User Agent headers or else you get a 403 data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text soup = BeautifulSoup(data, 'html.parser') rows_list = [] for detail in soup.find_all("details"): state = detail.find('summary') for li in detail.find_all('li'): p = re.compile(r'(\s*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s*(\d{1,2}),*\s*(\d{4}))', re.IGNORECASE) m = re.match(p, li.text) if m: rows_list.append([state.text, m.group(0), m.string.replace(m.group(0), '')]) else: print("Error", li.text) df = pd.DataFrame(rows_list) df.to_csv('out.csv')