【发布时间】:2022-01-21 14:43:41
【问题描述】:
我想从这个网站上刮掉所有没有分页按钮的产品。滚动时会自动加载产品。我的脚本只能抓取前 40 个产品。我意识到产品在 div 标签的数据页属性中动态加载? ? 我希望我的脚本不断更改数据页值并加载产品,但我不知道该怎么做。
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
url = 'https://www.positivepromotions.com/custom-blankets/c/navpp_1001_114/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
result = requests.get(url, headers=headers, timeout=5000)
data = result.content.decode()
soup1 = BeautifulSoup(data,'lxml')
## get the category
## get the conbtainer first container
subcategory = soup1.find('h1').text.strip()
itemlist = []
for soup in soup1.find_all('div', class_='row cat-prod-list'):
for x in range(1,4):
#for pages in soup.find_all('div', id='categoryProducts', attrs={'data-page': True}):
for pages in soup.select('div[data-page]', id='categoryProducts'):
print(pages['data-page'])
for productList in pages.find_all('div', class_='col-sm-4 col-md-3 cat-prod-container'):
title = productList.find('a', class_='product-title').text.strip()
price = productList.find('span', class_='cat-price').text.strip().split('-',1)[0]
sku = productList.find('div', class_='grid-prod-sku').text.strip()
#productlist = soup.find_all('div', class_='prod-img-wrap')
links = productList.find('a', class_='cat-prod-img',href=True)['href']
image = productList.find('img')['data-src'].split('?',1)[0]
items = {
'Title': title,
'Price': price,
'Sku': sku,
'Category': subcategory,
'Link': links,
'Image': image
}
itemlist.append(items)
##print('Saving : ',title)
#time.sleep(1)
# print total products found
print(len(itemlist))
#df = pd.DataFrame(itemlist)
##print(df.head(5))
#df.to_csv(subcategory+'.csv')
###
【问题讨论】:
-
尝试使用
selenium加载页面并在捕获数据之前滚动。
标签: python web-scraping beautifulsoup