亚马逊网络爬虫标签问题与python答案

【问题标题】：Amazon Web Crawler tags issue with python亚马逊网络爬虫标签问题与python
【发布时间】：2018-01-19 11:49:48
【问题描述】：

我有以下问题。我尝试从他的链接https://www.amazon.com/workout-clothes/b/ref=nav_shopall_sa_sp_athclg/151-4490025-2599936?ie=UTF8&node=11444071011 抓取亚马逊子类别我使用函数 begin_crawl()。如何从此链接中提取子类别？只看这行之后的代码：subcategories = page.find_all("div", {"class": "mm-column"})。从类别中提取子类别是否有另一种选择？我有 TypeError: 'NoneType' object is not callable。我附上了所有的错误代码。我将不胜感激。

def begin_crawl():

    with open(settings.start_file, "r") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("#"):
                continue  # skip blank and commented out lines

            page, html = make_request(line)
            count = 0

            # look for subcategory links on this page
           
            subcategories = page.find_all("div", {"class": "mm-column"})  
            subcategories.extend(page.find_all("ul", {"class": "mm-category-list"}))  
            subcategories.extend(page.find("li"))
            sidebar = page.find("div", "a-col-left")

            if sidebar:
                subcategories.extend(sidebar.findAll("li"))  # left sidebar

            for subcategory in subcategories:
                link = subcategory.find("a")
                if not link:
                    continue
                link = link["href"]
                count += 1
                enqueue_url(link)

            log("Found {} subcategories on {}".format(count, line))

错误是

Traceback (most recent call last):
  File "crawler.py", line 106, in <module>
    begin_crawl()  # put a bunch of subcategory URLs into the queue
  File "crawler.py", line 35, in begin_crawl
    subcategories = page.find_all("div", {"class": "mm-column"})  
TypeError: 'NoneType' object is not callable

【问题讨论】：

标签： python amazon

【解决方案1】：

子类别的正确选择器是

.bxc-grid__container a

【讨论】：

.bxc-grid__container 在哪里？在标题部分？
它在正文中与高尔夫等子类别关联的小部件......只需在源代码中搜索即可
谢谢！我现在明白了。