【发布时间】:2020-08-21 19:39:19
【问题描述】:
这是html代码:
<div class="wp-block-atomic-blocks-ab-accordion ab-block-accordion ab-font-size-18"><details><summary class="ab-accordion-title"><strong>American Samoa</strong></summary><div class="ab-accordion-text">
<ul><li><strong><a href="https://www.americansamoa.gov/covid-19-advisories" target="_blank" rel="noreferrer noopener" aria-label="American Samoa Department of Health Travel Advisory (opens in a new tab)">American Samoa Department of Health Travel Advisory</a></strong></li><li>March 2, 2020—Governor Moliga <a rel="noreferrer noopener" href="https://www.rnz.co.nz/international/pacific-news/410783/american-samoa-establishes-govt-taskforce-to-plan-for-coronavirus" target="_blank">appointed</a> a government taskforce to provide a plan for preparation and response to the covid-19 coronavirus. </li></ul>
<ul><li>March 25, 2020 – The Governor <a href="https://6fe16cc8-c42f-411f-9950-4abb1763c703.filesusr.com/ugd/4bfff9_2d3c78a841824b8aafe05032f853585b.pdf">issued</a> an Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health. The order requires the immediate and comprehensive enforcement by the Commissioner of Public Safety, Director of Health, Attorney General, and other agency leaders.
<ul>
<li>Business are also required to provide necessary supplies to the public and are prohibited from price gouging.</li>
</ul>
</li></ul>
</div></details></div>
我想提取状态、日期和文本并添加到包含这三列的数据框中
州:美属萨摩亚
日期:2020-03-25
文本:州长 001 号行政命令承认宣布的公共卫生紧急状态和紧急状态,以及对公共卫生的迫在眉睫的威胁
到目前为止我的代码:
soup = bs4.BeautifulSoup(data)
for tag in soup.find_all("summary"):
print("{0}: {1}".format(tag.name, tag.text))
for tag1 in soup.find_all("li"):
#print(type(tag1))
ln = tag1.text
dt = (ln.split(' – ')[0])
dt = (dt.split('—')[0])
#txt = ln.split(' – ')[1]
print(dt)
需要帮助:
- 如何获取文本直到“。”只是,我不需要整个测试
- 如何在循环时将数据框添加为新行(如果网页的源代码,我只附加了一部分)
感谢您的帮助!
【问题讨论】:
-
你能检查你的 html 的准确性吗?比如
<details>和<div class="ab-accordion-text">在哪里关闭? -
不会在任何地方关闭,我已经更新了的 HTML 代码好。如果有多个项目,它是在
<details>下显示为另一个<summary>等还是作为另一组<details>?是的,我正在阅读它的一个巨大网页,另一部分开始 -加利福尼亚
covid19.ca.gov" target="_blank">加州冠状病毒资源页面。
你能把你要抓取的页面的网址贴出来吗?
标签: html python-3.x dataframe web-scraping beautifulsoup