【问题标题】:How To Scrape Multiple URLs With Selenium and BeautifulSoup4 In Python如何在 Python 中使用 Selenium 和 BeautifulSoup4 抓取多个 URL
【发布时间】:2020-05-16 20:55:45
【问题描述】:

我一直在研究如何使用 Selenium、BS4 和 UserAgent 抓取多个 URL 的几种不同解决方案,到目前为止,我已经能够抓取 1 个 URL 来准确提取我想要的内容。只是当涉及到 1+ URL 时,我遇到了麻烦。

目前,我在下面有这段代码正在抓取第一页。如果您将urls 参数更改为仅url,取消注释url 变量,摆脱for url in urls 并删除for content in sel_soup 循环的缩进,您就会明白我的意思了。

我想创建一个循环来抓取,开始,只有 2 个网页,并确定它何时可以循环遍历这 2 个网页,我可以在列表中附加我拥有的其他 URL。

import requests
from bs4 import BeautifulSoup
import re
import csv
from fake_useragent import UserAgent
from selenium import webdriver

urls = ["https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=0&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=90&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD","https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"]
#url = "https://www.guitarcenter.com/Used/Bass.gc#pageName=used-page&N=18171+1076&Nao=180&recsPerPage=90&postalCode=02494&radius=100&profileCountryCode=US&profileCurrencyCode=USD"

user_agent = UserAgent()

for url in urls:

    web_r = requests.get(urls)
    web_soup = BeautifulSoup(web_r.text,"html.parser")

        #print(web_soup.findAll("li", class_="product-container")) #finding all of the grid items on the url above - price, photo, image, details and all
        #print(len(web_soup.findAll("li", class_="product-container"))) #printing out the length of the

    driver = webdriver.Firefox()
    driver.get(urls)
    html = driver.execute_script("return document.documentElement.outerHTML") #whats inside of this is a javascript call to get the outer html content of the page
    sel_soup = BeautifulSoup(html, "html.parser")


    for content in sel_soup.findAll("li", class_="product-container"):
            #print(content)

        bass_name = content.find("div", class_="productTitle").text.strip() #pulls the bass guitar name
        print(bass_name)

        prices_new = []
        for i in content.find("span", class_="productPrice").text.split("$"):
            prices_new.append(i.strip())
        bp = prices_new[1]
        print(bp)

【问题讨论】:

    标签: python selenium web-scraping beautifulsoup


    【解决方案1】:

    在您的for 循环中,对urls 的每次迭代都在使用您尝试抓取的单个url

    但是,在循环内对requests.getdriver.get 的调用中,您传递的是urls,它不是单个字符串url,而是整个列表。尝试在循环块内将urls 更改为url

    【讨论】:

    • 做到了!非常感谢!
    猜你喜欢
    • 2020-01-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-05-17
    • 2015-09-12
    • 1970-01-01
    相关资源
    最近更新 更多