如何从一个随机网站上抓取所有产品？答案

【问题标题】：How does one scrape all the products from a random website?如何从一个随机网站上抓取所有产品？
【发布时间】：2018-06-09 10:54:28
【问题描述】：

我试图从this website 获取所有产品，但不知何故，我认为我没有选择最好的方法，因为其中一些丢失了，我不知道为什么。这不是我第一次遇到这个问题。

我现在的做法是这样的：

转到网站的index page
从那里获取所有类别 (A-Z 0-9)
访问上述每个类别并从那里递归遍历所有子类别，直到到达产品页面
当我到达产品页面时，检查产品是否有更多 SKU。如果有，请获取链接。否则，这是唯一的 SKU。

现在，下面的代码可以工作，但它并没有得到所有的产品，而且我看不出它为什么会跳过一些产品的任何原因。也许我处理一切的方式是错误的。

from lxml import html
from random import randint
from string import ascii_uppercase
from time import sleep
from requests import Session


INDEX_PAGE = 'https://www.richelieu.com/us/en/index'
session_ = Session()


def retry(link):
    wait = randint(0, 10)
    try:
        return session_.get(link).text
    except Exception as e:
        print('Retrying product page in {} seconds because: {}'.format(wait, e))
        sleep(wait)
        return retry(link)


def get_category_sections():
    au = list(ascii_uppercase)
    au.remove('Q')
    au.remove('Y')
    au.append('0-9')
    return au


def get_categories():
    html_ = retry(INDEX_PAGE)
    page = html.fromstring(html_)
    sections = get_category_sections()

    for section in sections:
        for link in page.xpath("//div[@id='index-{}']//li/a/@href".format(section)):
            yield '{}?imgMode=m&sort=&nbPerPage=200'.format(link)


def dig_up_products(url):
    html_ = retry(url)
    page = html.fromstring(html_)

    for link in page.xpath(
            '//h2[contains(., "CATEGORIES")]/following-sibling::*[@id="carouselSegment2b"]//li//a/@href'
    ):
        yield from dig_up_products(link)

    for link in page.xpath('//ul[@id="prodResult"]/li//div[@class="imgWrapper"]/a/@href'):
        yield link

    for link in page.xpath('//*[@id="ts_resultList"]/div/nav/ul/li[last()]/a/@href'):
        if link != '#':
            yield from dig_up_products(link)


def check_if_more_products(tree):
    more_prods = [
        all_prod
        for all_prod in tree.xpath("//div[@id='pm2_prodTableForm']//tbody/tr/td[1]//a/@href")
    ]
    if not more_prods:
        return False
    return more_prods


def main():
    for category_link in get_categories():
        for product_link in dig_up_products(category_link):
            product_page = retry(product_link)
            product_tree = html.fromstring(product_page)
            more_products = check_if_more_products(product_tree)
            if not more_products:
                print(product_link)
            else:
                for sku_product_link in more_products:
                    print(sku_product_link)


if __name__ == '__main__':
    main()

现在，这个问题可能过于笼统，但我想知道当有人想从网站获取所有数据（在本例中为产品）时，是否有一个经验法则可以遵循。有人可以指导我完成发现处理此类场景的最佳方法的整个过程吗？

【问题讨论】：

跳过的是哪个，每次运行都是一样的还是每次都不一样？
据我所知，不同的。但这真的很难说，因为：我得到了 130k 种产品，其中 60% 以上是重复的。
“请引导我完成整个过程，找出处理这种情况的最佳方法是什么？”。我不认为有一个“过程”会一直有效。例如，一些网站采用了各种反抓取措施，使其难以做到这一点。它也可能是非法的。在 richelieu.com 的条款和条件中，它说“禁止 [...] 直接或间接使用任何数据挖掘方法或工具、搜索机器人或任何类似的自动化工具或方法来收集材料中的数据”（ richelieu.com/filiales/RC/html/ConditionsAn.html).
@mzjn 是“最后更新于 2006 年 2 月 1 日”。但是，虽然这可能仍然适用，但我这样做是出于学习目的
为什么不使用 BeautifulSoup4？例如每次找到 ItemImg 类时，从前面的锚标记中获取 href，跟随该页面进入项目，使用类似的方法获取实际项目...

标签： python python-3.x web-scraping lxml

【解决方案1】：

如果您的最终目标是抓取每个类别的整个产品列表，那么在索引页面上定位每个类别的完整产品列表可能是有意义的。该程序使用 BeautifulSoup 查找索引页面上的每个类别，然后遍历每个类别下的每个产品页面。最终输出是namedtuples 故事列表，每个类别名称带有当前页面链接和每个链接的完整产品标题：

url = "https://www.richelieu.com/us/en/index"
import urllib
import re
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import itertools
s = soup(str(urllib.urlopen(url).read()), 'lxml')
blocks = s.find_all('div', {'id': re.compile('index\-[A-Z]')})
results_data = {[c.text for c in i.find_all('h2', {'class':'h1'})][0]:[b['href'] for b in i.find_all('a', href=True)] for i in blocks}
final_data = []
category = namedtuple('category', 'abbr, link, products')
for category1, links in results_data.items():
   for link in links:
      page_data = str(urllib.urlopen(link).read())
      print "link: ", link
      page_links = re.findall(';page\=(.*?)#results">(.*?)</a>', page_data)
      if not page_links:
         final_page_data = soup(page_data, 'lxml')
         final_titles = [i.text for i in final_page_data.find_all('h3', {'class':'itemHeading'})]
         new_category = category(category1, link, final_titles)
         final_data.append(new_category)

      else:
         page_numbers = set(itertools.chain(*list(map(list, page_links))))

         full_page_links = ["{}?imgMode=m&sort=&nbPerPage=48&page={}#results".format(link, num) for num in page_numbers]
         for page_result in full_page_links:
            new_page_data = soup(str(urllib.urlopen(page_result).read()), 'lxml')
            final_titles = [i.text for i in new_page_data.find_all('h3', {'class':'itemHeading'})]
            new_category = category(category1, link, final_titles)
            final_data.append(new_category)

print final_data

输出将获得以下格式的结果：

[category(abbr=u'A', link='https://www.richelieu.com/us/en/category/tools-and-shop-supplies/workshop-accessories/tool-accessories/sander-accessories/1058847', products=[u'Replacement Plate for MKT9924DB Belt Sander', u'Non-Grip Vacuum Pads', u'Sandpaper Belt 2\xbd " x 14" for Compact Belt Sander PC371 or PC371K', u'Stick-on Non-Vacuum Pads', u'5" Non-Vacuum Disc Pad Hook-Face', u'Sanding Filter Bag', u'Grip-on Vacuum Pads', u'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x 10.79 cm (3" x 4-1/4")', u'4" Abrasive for Finishing Tool', u'Sander Backing Pad for RO 150 Sander', u'StickFix Sander Pad for ETS 125 Sander', u'Sub-Base Pad for Stocked Sanders', u'(5") Non-Vacuum Disc Pad Vinyl-Face', u'Replacement Sub-Base Pads for Stocked Sanders', u"5'' Multi-Hole Non-Vaccum Pad", u'Sander Backing Pad for RO 90 DX Sander', u'Converting Sanding Pad', u'Stick-On Vacuum Pads', u'Replacement "Stik It" Sub Base', u'Drum Sander/Planer Sandpaper'])....

要访问每个属性，调用如下：

categories = [i.abbr for i in final_data]
links = [i.links for i in final_data]
products = [i.products for i in final_data]

我相信使用BeautifulSoup 的好处是，它提供了对抓取的更高级别的控制并且易于修改。例如，如果 OP 改变了他想要抓取产品/索引的哪些方面的想法，则只需要对 find_all 参数进行简单更改，因为上面代码的一般结构围绕着每个产品类别从索引页。

【讨论】：

尽管这是解决问题的一个很好的尝试，但我不确定 OP 是否会切换到它。我们已经使用此代码检查了多个rounds of reviews，甚至是suggested a working Scrapy spider，这将大大优于OP 和此解决方案。我不知道 OP 的动机，但我认为在这里他想更多地理解为什么他的方法没有抓取所有数据。你的，你怎么知道？谢谢。

【解决方案2】：

正如@mzjn 和@alecxe 所指出的，一些网站采用了反抓取措施。为了隐藏他们的意图，爬虫应该尝试模仿人类访客。

网站检测抓取工具的一种特殊方法是测量后续页面请求之间的时间。这就是为什么抓取工具通常在请求之间保持（随机）延迟。

此外，对不属于你的网络服务器进行攻击而不给它一些松懈，这不被认为是好的网络礼仪。

来自Scrapy's documentation：

RANDOMIZE_DOWNLOAD_DELAY

默认：True

如果启用，Scrapy 将在从同一网站获取请求时等待随机时间（在 0.5 * DOWNLOAD_DELAY 和 1.5 * DOWNLOAD_DELAY 之间）。

这种随机化降低了抓取工具被网站检测到（并随后被阻止）的机会，这些网站分析请求以寻找在请求之间的时间上具有统计学意义的相似性。

随机化策略与wget --random-wait 选项使用的相同。

如果DOWNLOAD_DELAY 为零（默认），则此选项无效。

哦，请确保您的 HTTP 请求中的 User-Agent 字符串类似于普通 Web 浏览器的字符串。

进一步阅读：

【讨论】：

【解决方案3】：

首先，对于您如何知道已经抓取的数据是否是所有可用数据的一般性问题，没有明确的答案。这至少是特定于网站的，实际上很少被披露。此外，数据本身可能是高度动态的。在这个网站上，尽管您可能或多或少地使用产品计数器来验证找到的结果数量：

您最好的选择是调试 - 使用logging 模块在抓取时打印出信息，然后分析日志并查找缺少产品的原因以及导致该问题的原因。

我目前的一些想法：

retry() 是否是有问题的部分 - 是否session_.get(link).text 没有引发错误但响应中也不包含实际数据？
我认为您提取类别链接的方式是正确的，我没有看到您在索引页面上缺少类别
dig_up_products() 是有问题的：当您提取指向子类别的链接时，您在 XPath 表达式中使用了这个carouselSegment2b id，但我至少在某些页面（如this one）上看到了@ 987654329@ 的值为carouselSegment1b。无论如何，我可能会在这里做//h2[contains(., "CATEGORIES")]/following-sibling::div//li//a/@href
我也不喜欢 imgWrapper 类用于查找产品链接（可能是缺少图像的产品被遗漏了吗？）。为什么不只是：//ul[@id="prodResult"]/li//a/@href - 虽然这会带来一些您可以单独解决的重复项。但是，您也可以在产品容器的“信息”部分中查找链接：//ul[@id="prodResult"]/li//div[contains(@class, "infoBox")]//a/@href。

还可以部署反机器人、反网络抓取策略，可能会暂时禁止您的 IP 或/和用户代理，甚至混淆响应。也检查一下。

【讨论】：