将具有相同类名的 div 中的内容抓取到数组中 [Python]答案

【问题标题】：Scraping content from divs with same class names into arrays [Python]将具有相同类名的 div 中的内容抓取到数组中 [Python]
【发布时间】：2018-04-14 00:49:40
【问题描述】：

我开发 JavaScript 已经有一段时间了，但 Python 对我来说还是有点新鲜。我正在尝试使用 Python 从一个简单的网页中抓取内容（基本上是具有不同部分的产品列表）。内容是动态生成的，因此为此使用 selenium 模块。

内容结构是这样的，有几个产品部分：

<div class="product-section">
    <div class="section-title">
        Product section name
    </div>
    <ul class="products">
        <li class="product">
            <div class="name">Wooden Table</div>
            <div class="price">99 USD</div>
            <div class="color">White</div>
        </li>
    </ul>
</div>

用于抓取产品的 Python 代码：

driver = webdriver.Chrome()
driver.get("website.com")
names = driver.find_elements_by_css_selector('div.name')
prices = driver.find_elements_by_css_selector("div.price")
colors = driver.find_elements_by_css_selector('div.color')

allNames = [name.text for name in names]
allPrices = [price.text for price in prices]
allColors = [color.text for color in colors]

现在我得到了所有产品的属性（见下文），但我无法将它们与不同的部分分开。

当前结果
木桌，99 美元，白色
草坪椅，39 美元，黑色
帐篷 - 4 人， 299 美元，迷彩
等等

期望结果：
户外家具
木桌，99 美元，白色
草坪椅，39 美元，黑色
野营装备
帐篷 - 4 人，299 美元，迷彩
热水瓶，19 美元，金属

最终目标是将内容输出到 Excel 产品列表中，因此我需要将各个部分分开（与它们匹配的部分标题）。知道如何将它们分开，即使它们具有相同的类名？

【问题讨论】：

建议你在crummy.com/software/BeautifulSoup/bs4/doc查看Beautiful Soup库
好像有我需要的功能，谢谢！
BeatifulSoup 是一个非常强大的库，但对于简单的任务来说可能有点过头了——另一个需要学习的 api。香草硒刮很适合这样的任务。

标签： python selenium screen-scraping transformation

【解决方案1】：

您几乎完成了 - 按部分对产品进行分组，然后从一个部分开始并找到其中的所有元素。至少您的示例 html 暗示它的结构允许它。

根据您的代码，这是一个带有解释性 cmets 的解决方案。

driver = webdriver.Chrome()
driver.get('website.com')

# a dict where the key will be the section name
products = {}

# find all top-level sections
sections = driver.find_elements_by_css_selector('div.product-section')

# iterate over each one
for section in sections:
    # find the products that are children of this section
    # note the find() is based of section, not driver
    names = section.find_elements_by_css_selector('div.name')
    prices = section.find_elements_by_css_selector('div.price')
    colors = section.find_elements_by_css_selector('div.color')

    allNames = [name.text for name in names]
    allPrices = [price.text for price in prices]
    allColors = [color.text for color in colors]

    section_name = section.find_element_by_css_selector('div.section-title').text

    # add the current scraped section to the products dict
    # I'm leaving it to you to match the name, price and color of each ;)

    products[section_name] = {'names': allNames,
                              'prices': allPrices,
                              'colors': allColors,}

# and here's how to access the result

# get the 1st name in a section:
print(products['Product section name']['names'][0])  # will output "Wooden Table"

# iterate over the sections and products:
for section in products:
    print('Section: {}'.format(section))
    print('All prices in the section:')
    for price in section['prices']:
       print(price)

【讨论】：

非常感谢！这是我想到的确切结构，但不知道如何去做。