如何在 BS4 中有效地抓取多个 URL答案

【问题标题】：How to scrape multiple URLs effectively in BS4如何在 BS4 中有效地抓取多个 URL
【发布时间】：2018-04-03 20:05:09
【问题描述】：

我正在尝试找到一种有效的方法来抓取 BS4 中的多个页面。我能够轻松地抓取第一页并获取我需要的所有数据，但不幸的是，并非所有数据都在上面。还有 2 个其他页面要抓取，而不是硬编码并更改第二页和第三页的 URL，我想知道是否有使用 BS4 在 Python 中执行此操作的更优雅的方法。 URL 中唯一需要更改的部分是 page=1 到相应的页码（1、2、3）。

import csv 
import requests
from bs4 import BeautifulSoup


url = "https://www.congress.gov/members?q={%22congress%22:%22115%22}&pageSize=250&page=1"

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

names = soup.find_all()

items = soup.find_all("li","expanded")
for item in items:
    print(item.text)
    print(item.find("a"))
    with open('web.csv', 'a') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([item.find("a").encode('utf-8')])

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

这是一般网络抓取的困难之一。 BS4 无法帮助您生成逻辑来优雅地抓取 URL 并预测您需要的数据将驻留在网站上的哪个位置。每个网站都是不同的，并且在后端遵循不同的规则。

您可以做的最好的事情是查看网站本身，并尽力识别模式并根据页面上的内容动态提取 URL。该逻辑的优雅程度取决于您，并且在很大程度上取决于您正在抓取的网站。

【讨论】：

【解决方案2】：

遍历页码。 itertools.count 派上用场：

import itertools

for index in itertools.count(start=1):
    url = "https://www.congress.gov/members?q={%22congress%22:%22115%22}&pageSize=250&page="+str(index)

    # the rest of your code

【讨论】：

【解决方案3】：

您可以通过多种方式做到这一点。在这种情况下，更好的方法是将最后一个页码作为它的最高范围。该网页分三个不同的页面显示文档，最高页数为3。但是，如果您搜索https://www.congress.gov/members?q=%7B%22congress%22%3A%22115%22%7D&pageSize=250&page=5，您会看到网页仍然显示数据，而第 3 页中的数据已用尽。因此，您应该在这里定义最后一个页码（加 1）。

import requests
from bs4 import BeautifulSoup

my_url = "https://www.congress.gov/members?q=%7B%22congress%22%3A%22115%22%7D&pageSize=250&page={}"
for link in [my_url.format(page) for page in range(1,4)]:
    res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text, "lxml")
    for item in soup.select(".expanded"):
        name = item.select_one(".result-heading a").text
        print(name)

【讨论】：

【解决方案4】：

如果您知道最后一页的索引，只需按照上述答案中的建议进行迭代。如果最后一页索引未知 - 使用 while 循环和逻辑来决定是否继续抓取下一页。

import csv 
import requests
from bs4 import BeautifulSoup


url = "https://www.congress.gov/members?q={%22congress%22:%22115%22}&pageSize=250&page="
headers = {'User-Agent': 'Mozilla/5.0'}
pageId = 0

while True:
    pageId = pageId + 1
    print ("Processing page " + str(pageId))
    response = requests.get(url+str(pageId), headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    names = soup.find_all()
    if len(name) == 0:
        break

    items = soup.find_all("li","expanded")
    for item in items:
        print(item.text)
        print(item.find("a"))
        with open('web.csv', 'a') as csv_file:
            writer = csv.writer(csv_file)
            writer.writerow([item.find("a").encode('utf-8')])

【讨论】：