【发布时间】:2022-01-22 16:54:10
【问题描述】:
您好,我正在尝试从以下网页获取所有链接。当我们向下滚动时,此页面会加载新产品,我试图通过滚动到页面底部来获取所有产品的链接。我在 following this post 之后使用 requests_html 的 scrolldown 方法,但是它只获取无需滚动即可看到的产品链接。问题是它正在向下滚动整个页面而不是产品框架。如果您看到下图,则仅当您在产品框架底部滚动时才会加载产品。
我也试过 seleniumwire(检查下面的代码),但它做同样的事情,滚动到没有加载产品的页面底部。我如何只滚动产品 div?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from seleniumwire import webdriver
baseurl = "https://www.medplusmart.com/categories/personal-care_10102/skin-care_20002"
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/74.0.3729.169 Safari/537.36 '
}
driver = webdriver.Chrome(executable_path="/src/resources/chromedriver")
driver.implicitly_wait(30)
product_links = []
try:
SCROLL_PAUSE_TIME = 2
def interceptor(request):
del request.headers['Referer'] # Delete the header first
request.headers['Referer'] = header
# Set the interceptor on the driver
driver.request_interceptor = interceptor
# All requests will now use 'some_referer' for the referer
driver.get(baseurl)
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# r = requests.get(driver.page_source, headers=header)
print(driver.page_source)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# product_list = soup.find_all('div', class_='col-item productInfoDiv ')
#
# for itemprop in product_list:
# for link in itemprop.find_all('a', href=True):
# product_links.append("{}{}".format(baseurl, link['href']))
#
# product_links_uniq = set(product_links)
#
# print(product_links_uniq)
finally:
driver.quit()
from requests_html import HTML, HTMLSession
baseurl = "https://www.medplusmart.com/categories/personal-care_10102/skin-care_20002"
session = HTMLSession()
page = session.get(baseurl)
page.html.render(scrolldown=50, sleep=3)
html = HTML(html=page.text)
#noticeName = html.find('a href')
all_links = html.links
for ln in all_links:
print(ln)
print(len(all_links))
filtered_links = [link for link in all_links if link.startswith("/product")]
print(len(filtered_links))
【问题讨论】:
标签: selenium selenium-webdriver web-scraping python-requests-html