【问题标题】:Webscrape using BeautifulSoup to DataframeWebscrape 使用 BeautifulSoup 到 Dataframe
【发布时间】:2020-08-21 19:39:19
【问题描述】:

这是html代码:

    <div class="wp-block-atomic-blocks-ab-accordion ab-block-accordion ab-font-size-18"><details><summary class="ab-accordion-title"><strong>American Samoa</strong></summary><div class="ab-accordion-text">
    <ul><li><strong><a href="https://www.americansamoa.gov/covid-19-advisories" target="_blank" rel="noreferrer noopener" aria-label="American Samoa Department of Health Travel Advisory (opens in a new tab)">American Samoa Department of Health Travel Advisory</a></strong></li><li>March 2, 2020—Governor&nbsp;Moliga&nbsp;<a rel="noreferrer noopener" href="https://www.rnz.co.nz/international/pacific-news/410783/american-samoa-establishes-govt-taskforce-to-plan-for-coronavirus" target="_blank">appointed</a>&nbsp;a government taskforce&nbsp;to provide a plan for preparation and response to the covid-19 coronavirus.&nbsp;</li></ul>
    
    <ul><li>March 25, 2020 – The Governor <a href="https://6fe16cc8-c42f-411f-9950-4abb1763c703.filesusr.com/ugd/4bfff9_2d3c78a841824b8aafe05032f853585b.pdf">issued</a> an Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health. The order requires the immediate and comprehensive enforcement by the Commissioner of Public Safety, Director of Health, Attorney General, and other agency leaders.
    <ul>
    <li>Business are also required to provide necessary supplies to the public and are prohibited from price gouging.</li>
    </ul>
    </li></ul>
</div></details></div>

我想提取状态、日期和文本并添加到包含这三列的数据框中

州:美属萨摩亚
日期:2020-03-25
文本:州长 001 号行政命令承认宣布的公共卫生紧急状态和紧急状态,以及对公共卫生的迫在眉睫的威胁

到目前为止我的代码:

soup = bs4.BeautifulSoup(data)
for tag in soup.find_all("summary"):
    print("{0}: {1}".format(tag.name, tag.text))
    for tag1 in soup.find_all("li"):
        #print(type(tag1))
        ln = tag1.text
        dt = (ln.split(' – ')[0])
        dt = (dt.split('—')[0])
        #txt = ln.split(' – ')[1]
        print(dt)

需要帮助:

  1. 如何获取文本直到“。”只是,我不需要整个测试
  2. 如何在循环时将数据框添加为新行(如果网页的源代码,我只附加了一部分)

感谢您的帮助!

【问题讨论】:

  • 你能检查你的 html 的准确性吗?比如&lt;details&gt;&lt;div class="ab-accordion-text"&gt;在哪里关闭?
  • 不会在任何地方关闭,我已经更新了
    的 HTML 代码
  • 好。如果有多个项目,它是在&lt;details&gt; 下显示为另一个&lt;summary&gt; 等还是作为另一组&lt;details&gt;
  • 是的,我正在阅读它的一个巨大网页,另一部分开始 -
    加利福尼亚

    covid19.ca.gov" target="_blank">加州冠状病毒资源页面。

  • 你能把你要抓取的页面的网址贴出来吗?

标签: html python-3.x dataframe web-scraping beautifulsoup


【解决方案1】:

首先,我添加了以下代码。不幸的是,网页在使用 HTML 列表时并不统一,一些 ul 元素包含嵌套的 uls 其他元素则没有。这段代码并不完美,只是一个起点,例如American Samoa 有大量嵌套的ul 元素,因此在df 中只出现一次。

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'lxml')
rows_list = []
for detail in soup.find_all("details"):
    state = detail.find('summary')
    ul = detail.find('ul')
    for li in ul.find_all('li', recursive=False):
        # Three types of hyphen are used on this webpage
        split = re.split('(?:-|–|—)', li.text, 1)
        if len(split) == 2:
            rows_list.append([state.text, split[0], split[1]])
        else:
            print("Error", li.text)
df = pd.DataFrame(rows_list)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1):
    print(df)

它创建并打印一个包含 547 行的数据框,并为无法拆分的文本打印一些错误消息。您必须准确计算出您需要哪些数据以及如何调整代码以满足您的目的。

如果没有安装“lxml”,可以使用“html.parser”。

更新 另一种方法是使用正则表达式匹配任何以日期开头的字符串:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'html.parser')
rows_list = []
for detail in soup.find_all("details"):
    state = detail.find('summary')
    for li in detail.find_all('li'):
        p = re.compile(r'(\s*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s*(\d{1,2}),*\s*(\d{4}))', re.IGNORECASE)
        m = re.match(p, li.text)
        if m:
            rows_list.append([state.text, m.group(0), m.string.replace(m.group(0), '')])
        else:
            print("Error", li.text)
df = pd.DataFrame(rows_list)
df.to_csv('out.csv')

这提供了更多的记录 4,785。同样,这是一个起点,一些数据被遗漏,但要少得多。它将数据写入 csv 文件 out.csv。

【讨论】:

    猜你喜欢
    相关资源
    最近更新 更多
    热门标签