为什么我使用 python 构建的 Webscraper 在应该返回抓取的数据时返回一个空列表？答案

【问题标题】：Why does my Webscraper built using python return an empty list when it should return scraped data?为什么我使用 python 构建的 Webscraper 在应该返回抓取的数据时返回一个空列表？
【发布时间】：2021-07-27 16:36:26
【问题描述】：

我正在尝试从https://nike.co.in 抓取产品详细信息，例如产品名称、价格、类别、颜色尽管为脚本提供了正确的 Xpath，但它似乎并没有抓取细节，它给出了一个空列表。这是我的完整脚本：

import time
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager


def scrape_nike(shop_by_category):
    website_address = ['https://nike.co.in']
    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_argument("window-size=1200x600")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)

    browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    delays = [7, 4, 6, 2, 10, 19]
    delay = np.random.choice(delays)
    for crawler in website_address:
        browser.get(crawler)
        time.sleep(2)
        time.sleep(delay)

        browser.find_element_by_xpath('//*[@id="VisualSearchInput"]').send_keys(shop_by_category, Keys.ENTER)
        product_price = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[3]/div/div/div/div')
        product_price_list = [elem.text for elem in product_price]
        product_category = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[1]/div/div[2]')
        product_category_list = [elem.text for elem in product_category]
        product_name = browser.find_elements_by_xpath('//*[@id="Nike Air Zoom Vomero 15"]')
        product_name_list = [elem.text for elem in product_name]
        product_colors = browser.find_elements_by_xpath('//*[@id="Wall"]/div/div[5]/div/main/section/div/div[4]/div/figure/div/div[2]/div/button/div')
        product_colors_list = [elem.text for elem in product_colors]
        print(product_price_list)
        print(product_category_list)
        print(product_name_list)
        print(product_colors_list)


if __name__ == '__main__':
    category_name_list = ['running']
    for category in category_name_list:
        scrape_nike(category)

我想要的输出是这样的：

[Rs 1000, Rs 2990, Rs 3000,....]
[Mens running shoes, Womens running shoes, ...]
[Nike Air Zoom Pegasus, Nike Quest 3, ...]
[5 colors, 1 colors, 3 colors, ...]

但我现在得到的输出是：

[]
[]
[]
[]

我得到空列表的确切问题是什么？我不明白。请帮忙！！

编辑：我现在只能在列表中获取单个产品详细信息，而我想要所有产品，这是我对代码的更改

product_price = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[3]/div/div/div/div')))
        product_price_list = [elem.text for elem in product_price]
        product_category = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[1]/div/figure/div/div[1]/div/div[2]')))
        product_category_list = [elem.text for elem in product_category]
        product_name = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Nike Air Zoom Vomero 15"]')))
        product_name_list = [elem.text for elem in product_name]
        product_colors = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="Wall"]/div/div[5]/div/main/section/div/div[4]/div/figure/div/div[2]/div/button/div')))
        product_colors_list = [elem.text for elem in product_colors]

这给出了：

['₹13,495']
["Men's Running Shoe"]
['Nike Air Zoom Vomero 15']
['5 Colours']

我想要多个这样的条目

EDIT-2*：我也尝试过使用 beautifulsoup4，但也返回了空输出。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd


def adidas(shop_by_category):
    driver = webdriver.Chrome("F:\\chromedriver\chromedriver.exe")

    titles = []  # List to store name of the product
    prices = []  # List to store price of the product
    category = []  # List to store category of the product
    colors = []  # List to store the no of colors of the product

    # URL to fetch from Can be looped over / crawled multiple urls
    driver.get('https://nike.co.in')
    driver.find_element_by_xpath('//*[@id="VisualSearchInput"]').send_keys(shop_by_category, Keys.ENTER)
    content = driver.page_source
    soup = BeautifulSoup(content, features="lxml")

    # Parsing content
    for div in soup.findAll('div', attrs={'class': 'product-card__body'}):
        name = div.find('div', attrs={'class': 'product-card__title'})
        price = div.find('div', attrs={'class': 'product-price css-11s12ax is-current-price'})
        subtitle = div.find('div', attrs={'class': 'product-card__subtitle'})
        color = div.find('div', attrs={'class': 'product-card__product-count'})
        titles.append(name.text)
        prices.append(price.text)
        category.append(subtitle.text)
        colors.append(color.text)

    # Storing scraped content
    df = pd.DataFrame({'Product Name': titles, 'Price': prices, 'Category': category, 'Colors': colors})
    df.to_csv('adidas.csv', index=False, encoding='utf-8')


if __name__ == '__main__':
    category_name_list = ['running']
    for category in category_name_list:
        adidas(category)

【问题讨论】：

标签： python python-3.x selenium selenium-webdriver beautifulsoup

【解决方案1】：

您可以使用CLASS_NAME 选择器获取您需要的所有信息，因为每张产品卡都有一个描述性类别。

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

try:
    # Set the URL explicitly for the example
    driver.get("https://www.nike.com/in/w?q=running&vst=running")

    # Click away the blocking popup requesting cookie permissions
    # This is not the way to do it properly. This is to keep the sample short.
    driver.implicitly_wait(10)
    popup = driver.find_element(By.ID, 'hf_cookie_text_cookieAccept')
    popup.click()
    driver.implicitly_wait(10)

    # Begin scraping elements
    product_cards_container = driver.find_element(By.CLASS_NAME, "product-grid__items")
    product_cards = product_cards_container.find_elements(By.CLASS_NAME, "product-card")
    for card in product_cards:
        title = card.find_element(By.CLASS_NAME, "product-card__title")
        category = card.find_element(By.CLASS_NAME, "product-card__subtitle")
        colors = card.find_element(By.CLASS_NAME, "product-card__product-count")
        price = card.find_element(By.CLASS_NAME, "product-price")
        print(title.text)
        print(category.text)
        print(colors.text)
        print(price.text)

except Exception as e:
    print(e)
finally:
    driver.quit()

返回一个元素：

Nike Revolution 5 FlyEase
Men's Running Shoe
1 Colour
₹3,695

注意product_cards 如何使用复数find_elements，这允许我们迭代它的子元素，在这种情况下将包含产品卡片。一旦我们有了卡片WebElement，我们就可以在各个卡片的上下文中找到我们的数据。

您在编辑问题时使用了明确的等待，所以我假设您理解为什么这比随机的 time.sleep() 更好，但是，我将链接到 documentation on explicit waits，因为这将是有益的对于卡被“延迟加载”的任务。您可能还需要滚动到页面底部以收集所有产品卡片，您可以从documentation 或我的previous answer 到类似问题查看如何做到这一点。

【讨论】：

嗨 Lucan，感谢您的详细回答，我使用了您提供的代码并在我的脚本中运行它。它没有显示任何内容
我在测试时没有注意到我被重定向到nike.com/in，因为我不在印度。您能否检查这些类是否与我在示例代码中使用的相同？
是的，我已经检查过了，它们是一样的..它们仍然没有在输出中显示任何内容。
我已经编辑了问题并使用了beautifulsoup.4，但它仍然没有显示任何内容
@technophile_3 我已经更新了我的答案，以包含我用来获得结果的所有内容，包括一些非生产就绪代码，用于点击弹出 cookie 以及这给我的结果。跨度>