您确定实际的href 值是那个值吗?看起来可能是 javascript 生成的。
您可以运行scrapy shell "http://website/page?foo&bar" 来检查页面并使用允许/拒绝参数。您还可以针对任意 html 测试链接提取器,看看它是如何工作的。
In [1]: html = """
...: <a href="http://domain.tld/go.php?key=value">go</a>
...: <a href="/go.php?key=value2">go2</a>
...: <a href="/index.html">index</a>
...: """
In [2]: from scrapy.http import HtmlResponse
In [3]: response = HtmlResponse('http://example.com/', body=html)
In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [5]: lx = SgmlLinkExtractor()
In [6]: lx.extract_links(response)
Out[6]:
[Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False),
Link(url='http://example.com/go.php?key=value2', text=u'go2', fragment='', nofollow=False),
Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]
In [8]: SgmlLinkExtractor(allow='go\.php').extract_links(response)
Out[8]:
[Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False),
Link(url='http://example.com/go.php?key=value2', text=u'go2', fragment='', nofollow=False)]
In [9]: SgmlLinkExtractor(deny='go\.php').extract_links(response)
Out[9]: [Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]
In [10]: SgmlLinkExtractor(allow=('key=', 'index'), deny=('value2', )).extract_links(response)
Out[10]:
[Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False),
Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]
In [11]: SgmlLinkExtractor(allow='domain\.tld').extract_links(response)
Out[11]: [Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False)]
In [12]: SgmlLinkExtractor(allow='example.com').extract_links(response)
Out[12]:
[Link(url='http://example.com/go.php?key=value2', text=u'go2', fragment='', nofollow=False),
Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]