用于定制刮刀的 Scrapy 或 Beautifoulsoup？答案

【问题标题】：Scrapy or Beautifoulsoup for a custom scraper?用于定制刮刀的 Scrapy 或 Beautifoulsoup？
【发布时间】：2018-01-24 17:11:43
【问题描述】：

我在开发爬虫时需要指导。

我需要构建一个自定义抓取工具，从 3 个电子商务网站检索所有产品。

我用 Scrapy 构建了 PoC 刮板，但是，这个刮板有一个流程：

抓取工具需要将给定类别抓取到抓取深度级别 3，才能到达并访问我需要的页面，这些页面的深度级别为 1。

例如，抓取需要遵循这个顺序：

开始：domain.com
domain.com/category/sub_categry/mini_sub_category
domain.com/product1 和 domain.com/product2

product1 和 product2 的网址只有在达到深度级别 2（爬取子类别）时才能访问。

我的问题是我是否可以自定义 scrapy 以自动遵循此顺序或我是否需要使用 Beautifouldsoup 自定义构建一个刮板并手动提供每个 sub_category 并让 bs4 从那里开始刮？ p>

这是我的 Scrapy 代码

class DomainsSpider(CrawlSpider):
name = 'domains'
allowed_domains = ['www.amazon.com']
start_urls = ['http://www.amazon.com/']


rules = [Rule(LinkExtractor(canonicalize=True, unique=True),follow=True, callback="parse_items")]


def parse_items(self, response):

    # create the soup for the domain
    soup = BeautifulSoup(response.text, 'html.parser')
    #check if proxy is working
    if not soup.title.string:
        yield Request(url=response.url, dont_filter=True)


#extract the title      
    heading_1_raw = response.selector.xpath('//h1/text()').extract()
    heading_1_strip = [s.strip() for s in heading_1_raw]
    heading_1 = []


    for h1_text in range(0, len(heading_1_strip)):
        if str(heading_1_strip[h1_text]) != "":
            heading_1.append(heading_1_strip[h1_text])


    price_raw = response.selector.xpath('//p[contains(@class, "product-new-price")]//text()').extract()


    product_code_text = soup.find_all(string=re.compile("Cod produs"))


    yield {
        'url' : response.url,
        'page_title': soup.title.string,
        #'h1': h1s[0],
        'h1' : heading_1[0],
        'price' : price_raw,
        'product_code' : product_code_text

        }

【问题讨论】：

标签： python beautifulsoup scrapy logic

【解决方案1】：

您可以使用 scrapy 轻松完成您想要的操作，您只需为您的 CrawlSpider 提供一个描述如何进行抓取的规则列表。

像这样简单的事情可能会成功：

rules = [
    Rule(LinkExtractor(allow=['/category/'])),
    Rule(LinkExtractor(allow=['/product']), callback='parse_items')
]

如果您在理解或修改此代码时遇到问题，建议您阅读rules 和link extractors。

此外，无需在您的蜘蛛中使用 BeautifulSoup，内置的解析选择器能够提取您想要的任何数据。

【讨论】：

谢谢！我会调查一下并告诉你。
嗨，一直在研究这个问题，但我似乎无法理解产品 URL。我的意思是，如果有来自不同类别的不同产品，我如何通过参数来接受和抓取产品。例如，当产品页面之间没有关联时，如何指示它接受'/fossil-watch'和'/dell-laptop'等产品链接？查了好几天还是没搞明白。
这取决于网站的具体情况，真的。如果 url 共享一个共同的模式，你可能可以使用 allow，就像在我的例子中（使用正则表达式），如果产品链接位于相似的位置，你可以使用 restrict_xpaths 或 restrict_css，然后你可以将这些方式相互结合，并将其他参数组合到LinkExtractor。