使用 BeautifulSoup 从网页中抓取的 URL答案

【问题标题】：Use URLs scraped from a webpage with BeautifulSoup使用 BeautifulSoup 从网页中抓取的 URL
【发布时间】：2018-09-02 20:31:28
【问题描述】：

根据标题，我已经抓取了我感兴趣的网页并将 URL 保存在一个变量中。

import requests
from bs4 import BeautifulSoup

for pagenumber in range(1, 2):
    url = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22112%22%7D&page={}'.format(pagenumber)
    res = requests.get(url, headers = {'User-agent': 'Chrome'})

soup = BeautifulSoup(res.text, 'html.parser')
lists = soup.find_all("li", {"class" : "expanded"})

for bill in lists:
    block = bill.find("span", {"class":"result-item"})
    link_cosponsors = block.find_all("a")[1]['href'] # I am interested in the second URL

最后一行是给我的 URL 列表。现在我正在努力访问每个 URL 并从每个 URL 中抓取新信息。

for url in link_cosponsors:

    soup_cosponsor = BeautifulSoup(requests.get(url).text, 'html.parser')
    table = soup.find('table', {'class':'item_table'})

我认为问题在于创建 link_cosponsors 的方式，即列表的第一个元素不是完整的“https://etc”。但只有“h”，因为我收到错误“无效的 URL 'h'：未提供架构。也许你的意思是 http://h？”。我已尝试将链接附加到列表中，但这也不起作用。

【问题讨论】：

标签： python url beautifulsoup

【解决方案1】：

问题是您在 for 循环的每次迭代中都重新分配了 link_cosponsors。这样，此变量将仅将您找到的最后一个链接保存为字符串。

然后发生的事情是您的 for url in link_cosponsors 逐个字母地迭代该字符串。基本上是这样的：

for letter in 'http://the.link.you.want/foo/bar':
    print(letter)

解决方案：您应该将第一个 sn-p 的最后 3 行替换为：

link_cosponsors = []
for bill in lists:
    block = bill.find("span", {"class":"result-item"})
    link_cosponsors.append(block.find_all("a")[1]['href'])

【讨论】：

我已经尝试过这样做（并且刚刚再次尝试），但发生的情况是（假设页面中有 100 个 URL）首先它附加第一个 url，然后附加第一个和第二个URL，然后是第一个，第二个和第三个等等。所以第一个 URL 将重复 100 次。不知道为什么它不起作用。
嗯，它在这里工作：结果是 100 个不同的 URL。
你确定吗？我仍然得到 100 个项目，第一个长度为 1，第二个长度为 2 ...长度为 100 的第 100 个，这意味着第一个 URL 存储 100 次，第二个 99 等（您可以通过打印 len 轻松查看(link_cosponsors) 或打印 link_cosponsors)。也许你能截图你的输出？
@GildaRomano 你能分享你的整个剧本吗？您可以为此使用pastebin.com 等服务并在此处分享链接。