【发布时间】:2014-02-16 16:09:02
【问题描述】:
我正在尝试创建一个蜘蛛,它从一个域中获取所有 url,并创建一个域名记录以及该域上 url 中的所有标头。这是之前question 的延续。
我设法得到了帮助,并明白我需要使用 scrapy 框架中的 Item 管道来实现这一点。我在存储域名并附加所有标题的项目管道中创建一个字典/哈希。
我收到的错误是:unhashable type 'list'
蜘蛛.py
class MySpider(CrawlSpider):
name = 'Webcrawler'
allowed_domains = ['web.aitp.se']
start_urls = ['http://web.aitp.se/']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
domain=response.url.split("/")[2]
xpath = HtmlXPathSelector(response)
loader = XPathItemLoader(item=WebsiteItem(), response=response)
loader.add_value('domain',domain)
loader.add_xpath('h1',("//h1/text()"))
yield loader.load_item()
管道.py
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
from scrapy.http import Request
from Prospecting.items import WebsiteItem
from collections import defaultdict
class DomainPipeline(object):
global Accumulator
Accumulator = defaultdict(list)
def process_item(self, item, spider):
Accumulator[ item['domain'] ].append( item['h1'] )
def close_spider(spider):
yield Accumulator.items()
我试图解决这个问题,只是从 csv 文件中读取域和标头并将其合并到一个记录中,这可以正常工作。
from collections import defaultdict
Accumulator = defaultdict(list)
companies= open('test.csv','r')
for line in companies:
fields=line.split(',')
Accumulator[ fields[0] ].append(fields[1])
print Accumulator.items()
【问题讨论】:
-
问题是什么?
-
我如何摆脱错误 unhashable type 'list'。