【问题标题】:Why does my Webscraper built using python return an empty list when it should return scraped data?为什么我使用 python 构建的 Webscraper 在应该返回抓取的数据时返回一个空列表?
【发布时间】:2021-07-27 16:36:26
【问题描述】:

我正在尝试从https://nike.co.in 抓取产品详细信息,例如产品名称、价格、类别、颜色 尽管为脚本提供了正确的 Xpath,但它似乎并没有抓取细节,它给出了一个空列表。 这是我的完整脚本:

import time
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager


def scrape_nike(shop_by_category):
    website_address = ['https://nike.co.in']
    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_argument("window-size=1200x600")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)

    browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    delays = [7, 4, 6, 2, 10, 19]
    delay = np.random.choice(delays)
    for crawler in website_address:
        browser.get(crawler)
        time.sleep(2)
        time.sleep(delay)

        browser.find_element_by_xpath('//*[@id="VisualSearchInput"]').send_keys(shop_by_category, Keys.ENTER)
        product_price = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[3]/div/div/div/div')
        product_price_list = [elem.text for elem in product_price]
        product_category = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[1]/div/div[2]')
        product_category_list = [elem.text for elem in product_category]
        product_name = browser.find_elements_by_xpath('//*[@id="Nike Air Zoom Vomero 15"]')
        product_name_list = [elem.text for elem in product_name]
        product_colors = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[4]/div/figure/div/div[2]/div/button/div')
        product_colors_list = [elem.text for elem in product_colors]
        print(product_price_list)
        print(product_category_list)
        print(product_name_list)
        print(product_colors_list)


if __name__ == '__main__':
    category_name_list = ['running']
    for category in category_name_list:
        scrape_nike(category)

我想要的输出是这样的:

[Rs 1000, Rs 2990, Rs 3000,....]
[Mens running shoes, Womens running shoes, ...]
[Nike Air Zoom Pegasus, Nike Quest 3, ...]
[5 colors, 1 colors, 3 colors, ...]

但我现在得到的输出是:

[]
[]
[]
[]

我得到空列表的确切问题是什么?我不明白。请帮忙!!

编辑: 我现在只能在列表中获取单个产品详细信息,而我想要所有产品,这是我对代码的更改

product_price = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[3]/div/div/div/div')))
        product_price_list = [elem.text for elem in product_price]
        product_category = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[1]/div/div[2]')))
        product_category_list = [elem.text for elem in product_category]
        product_name = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Nike Air Zoom Vomero 15"]')))
        product_name_list = [elem.text for elem in product_name]
        product_colors = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[4]/div/figure/div/div[2]/div/button/div')))
        product_colors_list = [elem.text for elem in product_colors]

这给出了:

['₹13,495']
["Men's Running Shoe"]
['Nike Air Zoom Vomero 15']
['5 Colours']

我想要多个这样的条目

EDIT-2*:我也尝试过使用 beautifulsoup4,但也返回了空输出。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd


def adidas(shop_by_category):
    driver = webdriver.Chrome("F:\\chromedriver\chromedriver.exe")

    titles = []  # List to store name of the product
    prices = []  # List to store price of the product
    category = []  # List to store category of the product
    colors = []  # List to store the no of colors of the product

    # URL to fetch from Can be looped over / crawled multiple urls
    driver.get('https://nike.co.in')
    driver.find_element_by_xpath('//*[@id="VisualSearchInput"]').send_keys(shop_by_category, Keys.ENTER)
    content = driver.page_source
    soup = BeautifulSoup(content, features="lxml")

    # Parsing content
    for div in soup.findAll('div', attrs={'class': 'product-card__body'}):
        name = div.find('div', attrs={'class': 'product-card__title'})
        price = div.find('div', attrs={'class': 'product-price css-11s12ax is-current-price'})
        subtitle = div.find('div', attrs={'class': 'product-card__subtitle'})
        color = div.find('div', attrs={'class': 'product-card__product-count'})
        titles.append(name.text)
        prices.append(price.text)
        category.append(subtitle.text)
        colors.append(color.text)

    # Storing scraped content
    df = pd.DataFrame({'Product Name': titles, 'Price': prices, 'Category': category, 'Colors': colors})
    df.to_csv('adidas.csv', index=False, encoding='utf-8')


if __name__ == '__main__':
    category_name_list = ['running']
    for category in category_name_list:
        adidas(category)

【问题讨论】:

    标签: python python-3.x selenium selenium-webdriver beautifulsoup


    【解决方案1】:

    您可以使用CLASS_NAME 选择器获取您需要的所有信息,因为每张产品卡都有一个描述性类别。

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from webdriver_manager.chrome import ChromeDriverManager
    
    driver = webdriver.Chrome(ChromeDriverManager().install())
    
    try:
        # Set the URL explicitly for the example
        driver.get("https://www.nike.com/in/w?q=running&vst=running")
    
        # Click away the blocking popup requesting cookie permissions
        # This is not the way to do it properly. This is to keep the sample short.
        driver.implicitly_wait(10)
        popup = driver.find_element(By.ID, 'hf_cookie_text_cookieAccept')
        popup.click()
        driver.implicitly_wait(10)
    
        # Begin scraping elements
        product_cards_container = driver.find_element(By.CLASS_NAME, "product-grid__items")
        product_cards = product_cards_container.find_elements(By.CLASS_NAME, "product-card")
        for card in product_cards:
            title = card.find_element(By.CLASS_NAME, "product-card__title")
            category = card.find_element(By.CLASS_NAME, "product-card__subtitle")
            colors = card.find_element(By.CLASS_NAME, "product-card__product-count")
            price = card.find_element(By.CLASS_NAME, "product-price")
            print(title.text)
            print(category.text)
            print(colors.text)
            print(price.text)
    
    except Exception as e:
        print(e)
    finally:
        driver.quit()
    

    返回一个元素:

    Nike Revolution 5 FlyEase
    Men's Running Shoe
    1 Colour
    ₹3,695
    

    注意product_cards 如何使用复数find_elements,这允许我们迭代它的子元素,在这种情况下将包含产品卡片。一旦我们有了卡片WebElement,我们就可以在各个卡片的上下文中找到我们的数据。

    您在编辑问题时使用了明确的等待,所以我假设您理解为什么这比随机的 time.sleep() 更好,但是,我将链接到 documentation on explicit waits,因为这将是有益的对于卡被“延迟加载”的任务。您可能还需要滚动到页面底部以收集所有产品卡片,您可以从documentation 或我的previous answer 到类似问题查看如何做到这一点。

    【讨论】:

    • 嗨 Lucan,感谢您的详细回答,我使用了您提供的代码并在我的脚本中运行它。它没有显示任何内容
    • 我在测试时没有注意到我被重定向到nike.com/in,因为我不在印度。您能否检查这些类是否与我在示例代码中使用的相同?
    • 是的,我已经检查过了,它们是一样的..它们仍然没有在输出中显示任何内容。
    • 我已经编辑了问题并使用了beautifulsoup.4,但它仍然没有显示任何内容
    • @technophile_3 我已经更新了我的答案,以包含我用来获得结果的所有内容,包括一些非生产就绪代码,用于点击弹出 cookie 以及这给我的结果。跨度>
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-04-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-29
    相关资源
    最近更新 更多