【问题标题】:Not getting all links from webpage没有从网页获取所有链接
【发布时间】:2021-01-30 10:17:18
【问题描述】:

我正在做一个网页抓取项目。我正在抓取的网站的 URL 是 https://www.beliani.de/sofas/ledersofa/

我正在抓取此页面上列出的所有产品链接。我尝试使用 Requests-HTMLSelenium 获取链接。但我分别得到 57 和 24 个链接。虽然页面上列出了 150 多种产品。 以下是我正在使用的代码块。

使用硒:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")

#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)

url = 'https://www.beliani.de/sofas/ledersofa/'

driver.get(url)
sleep(20)

links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
    print(a)
    links.append(a)
print(len(links))

使用 Request-HTML:

from requests_html import HTMLSession

url = 'https://www.beliani.de/sofas/ledersofa/'

s = HTMLSession()
r = s.get(url)

r.html.render(sleep = 20)

products = r.html.xpath('//*[@id="offers_div"]', first = True)

#Getting 57 links using below block:
links = []
for link in products.absolute_links:
    print(link)
    links.append(link)

print(len(links))

我不知道我做错了哪一步或缺少了什么。

【问题讨论】:

    标签: python selenium web-scraping python-requests-html


    【解决方案1】:

    您必须滚动浏览网站并到达页面末尾才能加载网页中的所有脚本。只需打开网站,我们将仅加载查看网页特定部分所需的脚本。因此,当您运行代码时,它只能从已加载的那些脚本中检索数据。

    这个给了我160个链接:

    driver.get('https://www.beliani.de/sofas/ledersofa/')
    sleep(3)
    
    #gets the whole height of the document
    height = driver.execute_script('return document.body.scrollHeight')
    
    # now break the webpage into parts so that each section in the page is scrolled through to load
    scroll_height = 0
    for i in range(10):
        scroll_height = scroll_height + (height/10)
        driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
        sleep(2)
    
    # I have used the 'class' locator you can use anything you want once we have completed the loop
    a_tags = driver.find_elements_by_class_name('itemBox')
    count = 0
    for i in a_tags:
        if i.get_attribute('href') is not None:
            print(i.get_attribute('href'))
            count+=1
    
    print(count)
    driver.quit()
    

    【讨论】:

      【解决方案2】:

      要使用Selenium 提取链接总数,您需要接受cookie,并且您必须为visibility_of_all_elements_located() 诱导WebDriverWait,您可以使用以下Locator Strategies 之一:

      • 使用CSS_SELECTOR

        driver.get("https://www.beliani.de/sofas/ledersofa/")
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[value='Akzeptieren']"))).click()
        print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#offers_div > div > div > a[href]")))))
        
      • 使用XPATH

        driver.get("https://www.beliani.de/sofas/ledersofa/")
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@value='Akzeptieren']"))).click()
        print(len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='offers_div']/div/div/a[@href]")))))
        
      • 注意:您必须添加以下导入:

        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support import expected_conditions as EC
        

      【讨论】:

        猜你喜欢
        • 2013-02-23
        • 2018-03-26
        • 2014-01-30
        • 2018-01-03
        • 2020-02-29
        • 2011-10-05
        • 1970-01-01
        • 2020-09-13
        • 1970-01-01
        相关资源
        最近更新 更多