【问题标题】:How to wait to page to fully load using requests_html如何使用 requests_html 等待页面完全加载
【发布时间】:2021-09-12 09:53:39
【问题描述】:

在使用 requests_html 访问此链接 https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2 时,我需要等待一段时间才能真正加载页面。这有可能吗? 我的代码:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
from lxml import etree

s = HTMLSession()
response = s.get(
    'https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2')
response.html.render()


soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
item = dom.xpath('//a[@class="rs_product_description d-block"]/text()')[0]
print(item)

【问题讨论】:

  • 那个答案说要使用“r.html.render()”,我已经在这样做了。
  • @Ibstam Ch pip install requests-html from requests_html import HTMLSession from requests_html import AsyncHTMLSession
  • 我认为你没有添加 requests-html

标签: python web-scraping python-requests-html


【解决方案1】:

您正在查找的数据似乎可以使用 HTTP GET 获取到
https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%22pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D

该调用将返回一个 JSON,您可以直接使用该 JSON 并使用零抓取代码。

将 URL 复制/粘贴到浏览器中 --> 查看数据。

可以指定网址中的页码:

searchVO={"selectedCategory":"12301_1809051","selectedStore":"0","selectedSort":1,"selectedFilters":{},"storeId":15108,"pageNumber":2,"pageSize":48,"totalCount":112,"searchTypes":["PINNING"],"isFamilyPage":true,"appliedSeoFilters":false,"snbAudience":"","zipcode":""}

下面的工作代码

import requests
import pprint

page_num = 2
url = f'https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%2{page_num}pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D'

r = requests.get(url)
if r.status_code == 200:
    pprint.pprint(r.json())

【讨论】:

  • 是的,但是在这个请求中我不能指定页码
  • @IbtsamCh - 答案已更新为可让您指定页码的工作代码。享受:-)
  • 嘿,抱歉,我也想更改 dickssportinggoods.com/c/camping-hiking-gear 上可用类别的选择类别。有没有办法找到这些类别编号?我在任何地方都找不到他们
  • 在我的回答中查看searchVO dict。在那里你可以找到可用的参数。
  • 是的,我们可以肯定地改变它,但我找不到其他类别的类别编号,除了这个类别的 api url 有“selectedCategory”:“12301_1809051”。这个值是其他类别的变化
【解决方案2】:

Selenium也可以诱导headless mode.

Selenium 有能力wait unit elements are found with Explicit waits.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('--window-size=1920,1080')
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path = driver_path, options = options)
driver.get("URL here")
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[@class='rs_product_description d-block']")))

PS:你必须从here下载chromedriver

【讨论】:

  • 是的,但我想避免使用硒。没有别的办法吗?
  • 你不想使用 Selenium 的原因是什么?
  • 因为它也很慢且不一致。很多时候它会正常工作,但只要我添加无头参数它就会停止工作。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2019-02-10
  • 1970-01-01
  • 2023-03-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多