初学者 - 网页抓取 - 下载数据答案

【问题标题】：Beginner - Web scraping - Download data初学者 - 网页抓取 - 下载数据
【发布时间】：2021-11-06 03:14:07
【问题描述】：

我要下载这个数据https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/index.html

到目前为止，我能够从标签 p 中获取链接，这些是每个月的链接，但挑战是在每个链接下它们是 31 个文件（每天），我尝试了几种方法从堆栈中获取 h2标题，以及

from bs4 import BeautifulSoup
import urllib.request as urllib2

url = "https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
headings = soup.findAll('h2');


req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print("The href links are :")

print (headings)

for link in soup.find_all('a'):
   print(link.get('href')) 
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text

这是他们的输出（仅粘贴最后一个输出）

['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#December',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#November',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#October',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#September',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#August',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#July',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#June',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#May',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#April',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#March',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#February',
 'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#January'

我的问题是这些是 h2 标签标题，而且我需要来自每个 h2 标签的链接，这些链接存储在每个 a 标签下。虽然上面的程序给了我所有确实有这些链接的链接，但是如果我能以有组织的方式或任何其他方式获得它们，这样可以更容易地直接从 html 站点存储数据，那就太好了。我将不胜感激。谢谢。你！

【问题讨论】：

您是否有特殊原因想要获取文件名？具有真实数据的表似乎存在于这种模式的 URL 中：nrc.gov/reading-rm/doc-collections/event-status/reactor-status/…。您可以很容易地手动制作这些 URL，因为文件名中唯一改变的是年/月/日。
@drnugent 是的，这就是原因。然后每天都有类似的文件，这会使每个 url 的文件本身太大。有什么办法，我可以自动下载，因为我们也需要其他几年的数据。
是的，您可以为从数据集开始到数据集结束的所有日期创建一个循环，使用该年/月/日生成给定格式的文件名，然后运行requests.get() 在那个 URL 上。
@drnugent 有帮助！谢谢

标签： python html web-scraping beautifulsoup

【解决方案1】：

如果您看到h2 标记没有数据，则查找所有h2 标记并循环遍历它，因此我们必须找到使用find_next 方法的下一个标记在p标签上

现在我们必须找到所有a 标签，所以我们将使用find_all 方法，我在一行代码中完成了此操作，它将返回链接列表

现在我们将遍历它并仅提取href 部分，但有一个cath href 不正确它包含20001231ps.html 这样但我们需要像这样的20041231ps.html 所以这就是我这样做的原因替换和追加字符串的过程

我使用了dict1，它将键附加为月份，值附加为链接列表，因此很容易提取。

代码：

months=soup.find_all("h2")
dict1={}
main_url="https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004"
for month in months:
    dict1[month.text]=[main_url+"/"+link['href'].replace("2000","2004") for link in month.find_next("p").find_all("a")]

输出：

{'December': ['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041231ps.html',
  'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041230ps.html',
  'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041229ps.html',
  'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041228ps.html',
  'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041227ps.html',
  'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041226ps.html',
  'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041225ps.html',
  'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041224ps.html',
.....
]}

【讨论】：

这就是我要找的，谢谢！这很有帮助！ @bhavya-parikh
如果可能，您可以投票或接受作为答案！！
我在尝试，但它不允许我，我在这里没有那么活跃。当我有积分时，我一定会自己投票。谢谢！
它将返回所有月份以及我刚刚为您提供示例输出的每日链接
是的，成功了！欣赏！