【问题标题】:Scrape data from a website that URL doesn't change从 URL 不变的网站中抓取数据
【发布时间】:2020-12-19 07:52:33
【问题描述】:

我是网络抓取的新手,但对请求有足够的命令,BeautifulSoup 和 Selenium 可以从网站提取数据。现在的问题是,当点击下一页的页码时,我正在尝试从网站上抓取 URL 不会更改的数据。

Page number in inspection

网站网址 ==> https://www.ellsworth.com/products/adhesives/

我也尝试了 Google 开发者工具,但没有成功。如果有人用代码指导我,将不胜感激。 Google Developer show Get Request

这是我的代码

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import requests
itemproducts = pd.DataFrame()
driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get('https://www.ellsworth.com/products/adhesives/')
base_url = 'https://www.ellsworth.com'
html= driver.page_source
s = BeautifulSoup(html,'html.parser')
data = []

href_link = s.find_all('div',{'class':'results-products-item-image'})
for links in href_link:
    href_link_a = links.find('a')['href']
    data.append(base_url+href_link_a)
# url = 'https://www.ellsworth.com/products/adhesives/silicone/dow-838-silicone-adhesive-sealant-white-90-ml-tube/'

for c in data:
    driver.get(c)
    html_pro = driver.page_source
    soup = BeautifulSoup(html_pro,'html.parser')
    title = soup.find('span',{'itemprop':'name'}).text.strip()
    part_num = soup.find('span',{'itemprop':'sku'}).text.strip()
    manfacture = soup.find('span',{'class':'manuSku'}).text.strip()
    manfacture_ = manfacture.replace('Manufacturer SKU:', '').strip()
    pro_det = soup.find('div',{'class':'product-details'})
    p = pro_det.find_all('p')
    try:
        d = p[1].text.strip()    
        c = p.text.strip()
    except:
        pass
    table = pro_det.find('table',{'class':'table'})
    tr = table.find_all('td')
    typical = tr[1].text.strip()
    brand = tr[3].text.strip()
    color = tr[5].text.strip()
    image = soup.find('img',{'itemprop':'image'})['src']
    image_ = base_url + image
    png_url = title +('.jpg')
    img_data = requests.get(image_).content
    with open(png_url,'wb') as fh:
        fh.write(img_data)

    itemproducts=itemproducts.append({'Product Title':title,
                                     'Part Number':part_num,
                                     'SKU':manfacture_,
                                     'Description d':d,
                                     'Description c':c,
                                     'Typical':typical,
                                     'Brand':brand,
                                     'Color':color,
                                     'Image URL':image_},ignore_index=True)

【问题讨论】:

标签: python selenium web-scraping beautifulsoup python-requests


【解决方案1】:

页面的内容是动态呈现的,但是如果您在开发者工具中检查 Network 下的 XHR 选项卡,您可以获取 API 请求 url。我已经将 URL 缩短了一点,但它仍然可以正常工作。

您可以通过以下方式从第 1 页获取前 10 种产品的列表:

import requests

start = 0
n_items = 10

api_request_url = f"https://www.ellsworth.com/api/catalogSearch/search?sEcho=1&iDisplayStart={start}&iDisplayLength={n_items}&DefaultCatalogNode=Adhesives&_=1497895052601"

data = requests.get(api_request_url).json()

print(f"Found: {data['iTotalRecords']} items.")

for item in data["aaData"]:
    print(item)

这将为您提供包含每个产品的所有数据的漂亮 JSON 响应,这应该可以帮助您入门。

['Sauereisen Insa-Lute Adhesive Cement No. P-1 Powder Off-White 1 qt Can', 'P-1-INSA-LUTE-ADHESIVE', 'P-1 INSA-LUTE ADHESIVE', '$72.82', '/products/adhesives/ceramic/sauereisen-insa-lute-adhesive-cement-no.-p-1-powder-off-white-1-qt-can/', '/globalassets/catalogs/sauereisen-insa-lute-cement-no-p-1-off-white-1qt_170x170.jpg', 'Adhesives-Ceramic', '[{"qty":"1-2","price":"$72.82","customerPrice":"$72.82","eachPrice":"","custEachPrice":"","priceAmount":"72.820000000","customerPriceAmount":"72.820000000","currency":"USD"},{"qty":"3-15","price":"$67.62","customerPrice":"$67.62","eachPrice":"","custEachPrice":"","priceAmount":"67.620000000","customerPriceAmount":"67.620000000","currency":"USD"},{"qty":"16+","price":"$63.36","customerPrice":"$63.36","eachPrice":"","custEachPrice":"","priceAmount":"63.360000000","customerPriceAmount":"63.360000000","currency":"USD"}]', '', '', '', 'P1-Q', '1000', 'true', 'Presentation of packaged goods may vary. For special packaging requirements, please call (877) 454-9224', '', '', '']

如果要获取接下来的 10 项,则必须将 iDisplayStart 的值修改为 10。如果您希望每个请求有更多项目,只需将 iDisplayLength 更改为 20

在演示中,我将这些值替换为 startn_items,但您可以轻松地自动执行此操作,因为找到的所有项目的数量都与响应一起提供,例如iTotalRecords.

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-07-06
    • 2015-01-15
    • 1970-01-01
    • 2018-07-01
    • 1970-01-01
    相关资源
    最近更新 更多