【问题标题】:Amazon Web Crawler tags issue with python亚马逊网络爬虫标签问题与python
【发布时间】:2018-01-19 11:49:48
【问题描述】:

我有以下问题。我尝试从他的链接https://www.amazon.com/workout-clothes/b/ref=nav_shopall_sa_sp_athclg/151-4490025-2599936?ie=UTF8&node=11444071011 抓取亚马逊子类别 我使用函数 begin_crawl()。如何从此链接中提取子类别?只看这行之后的代码:subcategories = page.find_all("div", {"class": "mm-column"})。从类别中提取子类别是否有另一种选择?我有 TypeError: 'NoneType' object is not callable。我附上了所有的错误代码。我将不胜感激。

def begin_crawl():

    with open(settings.start_file, "r") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("#"):
                continue  # skip blank and commented out lines

            page, html = make_request(line)
            count = 0

            # look for subcategory links on this page
           
            subcategories = page.find_all("div", {"class": "mm-column"})  
            subcategories.extend(page.find_all("ul", {"class": "mm-category-list"}))  
            subcategories.extend(page.find("li"))
            sidebar = page.find("div", "a-col-left")

            if sidebar:
                subcategories.extend(sidebar.findAll("li"))  # left sidebar

            for subcategory in subcategories:
                link = subcategory.find("a")
                if not link:
                    continue
                link = link["href"]
                count += 1
                enqueue_url(link)

            log("Found {} subcategories on {}".format(count, line))

错误是

Traceback (most recent call last):
  File "crawler.py", line 106, in <module>
    begin_crawl()  # put a bunch of subcategory URLs into the queue
  File "crawler.py", line 35, in begin_crawl
    subcategories = page.find_all("div", {"class": "mm-column"})  
TypeError: 'NoneType' object is not callable

【问题讨论】:

    标签: python amazon


    【解决方案1】:

    子类别的正确选择器是

    .bxc-grid__container a
    

    【讨论】:

    • .bxc-grid__container 在哪里?在标题部分?
    • 它在正文中与高尔夫等子类别关联的小部件......只需在源代码中搜索即可
    • 谢谢!我现在明白了。
    猜你喜欢
    • 2022-12-15
    • 2017-03-11
    • 2023-04-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-08-03
    • 1970-01-01
    相关资源
    最近更新 更多