如何在 Python 中使用 Selenium 和 BeautifulSoup4 抓取多个 URL答案

【问题标题】：How To Scrape Multiple URLs With Selenium and BeautifulSoup4 In Python如何在 Python 中使用 Selenium 和 BeautifulSoup4 抓取多个 URL
【发布时间】：2020-05-16 20:55:45
【问题描述】：

我一直在研究如何使用 Selenium、BS4 和 UserAgent 抓取多个 URL 的几种不同解决方案，到目前为止，我已经能够抓取 1 个 URL 来准确提取我想要的内容。只是当涉及到 1+ URL 时，我遇到了麻烦。

目前，我在下面有这段代码正在抓取第一页。如果您将urls 参数更改为仅url，取消注释url 变量，摆脱for url in urls 并删除for content in sel_soup 循环的缩进，您就会明白我的意思了。

我想创建一个循环来抓取，开始，只有 2 个网页，并确定它何时可以循环遍历这 2 个网页，我可以在列表中附加我拥有的其他 URL。

import requests
from bs4 import BeautifulSoup
import re
import csv
from fake_useragent import UserAgent
from selenium import webdriver

urls = ["https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=0&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=90&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"]
#url = "https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"

user_agent = UserAgent()

for url in urls:

    web_r = requests.get(urls)
    web_soup = BeautifulSoup(web_r.text,"html.parser")

        #print(web_soup.findAll("li", class_="product-container")) #finding all of the grid items on the url above - price, photo, image, details and all
        #print(len(web_soup.findAll("li", class_="product-container"))) #printing out the length of the

    driver = webdriver.Firefox()
    driver.get(urls)
    html = driver.execute_script("return document.documentElement.outerHTML") #whats inside of this is a javascript call to get the outer html content of the page
    sel_soup = BeautifulSoup(html, "html.parser")


    for content in sel_soup.findAll("li", class_="product-container"):
            #print(content)

        bass_name = content.find("div", class_="productTitle").text.strip() #pulls the bass guitar name
        print(bass_name)

        prices_new = []
        for i in content.find("span", class_="productPrice").text.split("$"):
            prices_new.append(i.strip())
        bp = prices_new[1]
        print(bp)

【问题讨论】：

标签： python selenium web-scraping beautifulsoup

【解决方案1】：

在您的for 循环中，对urls 的每次迭代都在使用您尝试抓取的单个url。

但是，在循环内对requests.get 和driver.get 的调用中，您传递的是urls，它不是单个字符串url，而是整个列表。尝试在循环块内将urls 更改为url。

【讨论】：

做到了！非常感谢！