【发布时间】:2020-06-06 20:29:39
【问题描述】:
我正在使用 scrapy 2.1.0 并希望通过 link_filtering 向每个请求添加参数。这可行,但我确实遇到了无限循环,因为重复过滤器似乎受到了影响。
rules = (
Rule(
LinkExtractor(
allow=['^(example)?\/(?!ratgeber)[a-z-]+\/(\?p=\d+)?$'],
restrict_xpaths=(['//div[@class="sidebar--categories-navigation"]', # only navi pannel
'//div[contains(@class,"panel--paging")]/a']), # include pagination
),
follow=True,
process_links='link_filtering',
callback= 'parse_item'
),
)
添加链接过滤:
# get max amount of results per category and add n=x results to url
def link_filtering(self, links):
for link in links:
if re.match('.*\?.*',link.url) is None: #add all parameters if there are none
link.url = "%s?p=1&followSearch=10000&o=1&n=1000" % link.url
else: # add max amount of results to pagination
link.url = "%s&followSearch=10000&o=1&n=1000" % link.url
return links
抓取工具会一遍又一遍地继续抓取相同的网址。如何防止这种情况并保留添加的参数?
【问题讨论】:
标签: scrapy