由于 process_links，Scrapy 陷入无限循环答案

【问题标题】：Scrapy runs into endless loop due to process_links由于 process_links，Scrapy 陷入无限循环
【发布时间】：2020-06-06 20:29:39
【问题描述】：

我正在使用 scrapy 2.1.0 并希望通过 link_filtering 向每个请求添加参数。这可行，但我确实遇到了无限循环，因为重复过滤器似乎受到了影响。

rules = (
    Rule(
        LinkExtractor(
            allow=['^(example)?\/(?!ratgeber)[a-z-]+\/(\?p=\d+)?$'],
            restrict_xpaths=(['//div[@class="sidebar--categories-navigation"]', # only navi pannel
                              '//div[contains(@class,"panel--paging")]/a']), # include pagination         
        ), 
        follow=True,
        process_links='link_filtering',                    
        callback= 'parse_item'
    ),
)

添加链接过滤：

# get max amount of results per category and add n=x results to url
def link_filtering(self, links):
    for link in links:
        if re.match('.*\?.*',link.url) is None: #add all parameters if there are none
            link.url = "%s?p=1&followSearch=10000&o=1&n=1000" % link.url
        else:  # add max amount of results to pagination
            link.url = "%s&followSearch=10000&o=1&n=1000" % link.url
    return links

抓取工具会一遍又一遍地继续抓取相同的网址。如何防止这种情况并保留添加的参数？

【问题讨论】：

标签： scrapy

【解决方案1】：

from w3lib.url import canonicalize_url

然后

link.url = canonicalize_url(link.url)

这会有帮助吗？

并保留原来的回报

【讨论】：

这不正是我的link_filtering方法吗？
而不是返回使用 generators.is 不太相似。您返回的每个链接都是有效的，而不是链接列表
我没有检查你所有的代码，只是我注意到了这部分
似乎是一样的结果？我用“yield link”替换了return，它仍然被困在循环中。
如果对 canonicalize_url 的更改不起作用，我现在没有想法