【发布时间】:2021-11-06 03:14:07
【问题描述】:
我要下载这个数据https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/index.html
到目前为止,我能够从标签 p 中获取链接,这些是每个月的链接,但挑战是在每个链接下它们是 31 个文件(每天),我尝试了几种方法从堆栈中获取 h2标题,以及
from bs4 import BeautifulSoup
import urllib.request as urllib2
url = "https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
headings = soup.findAll('h2');
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print("The href links are :")
print (headings)
for link in soup.find_all('a'):
print(link.get('href'))
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text
这是他们的输出(仅粘贴最后一个输出)
['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#December',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#November',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#October',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#September',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#August',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#July',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#June',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#May',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#April',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#March',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#February',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#January'
我的问题是这些是 h2 标签标题,而且我需要来自每个 h2 标签的链接,这些链接存储在每个 a 标签下。 虽然上面的程序给了我所有确实有这些链接的链接,但是如果我能以有组织的方式或任何其他方式获得它们,这样可以更容易地直接从 html 站点存储数据,那就太好了。我将不胜感激。谢谢。你!
【问题讨论】:
-
您是否有特殊原因想要获取文件名?具有真实数据的表似乎存在于这种模式的 URL 中:nrc.gov/reading-rm/doc-collections/event-status/reactor-status/…。您可以很容易地手动制作这些 URL,因为文件名中唯一改变的是年/月/日。
-
@drnugent 是的,这就是原因。然后每天都有类似的文件,这会使每个 url 的文件本身太大。有什么办法,我可以自动下载,因为我们也需要其他几年的数据。
-
是的,您可以为从数据集开始到数据集结束的所有日期创建一个循环,使用该年/月/日生成给定格式的文件名,然后运行requests.get() 在那个 URL 上。
-
@drnugent 有帮助!谢谢
标签: python html web-scraping beautifulsoup