【发布时间】:2015-11-23 16:02:37
【问题描述】:
帮助!阅读以下scrapy代码和爬虫的结果。我想从http://china.fathom.info/data/data.json抓取一些数据,并且只允许Scrapy。但我不知道如何控制产量的顺序。我期待在循环中处理所有parse_member请求,然后返回group_item,但似乎yield item总是在yield request之前执行。
start_urls = [
"http://china.fathom.info/data/data.json"
]
def parse(self, response):
groups = json.loads(response.body)['group_members']
for i in groups:
group_item = GroupItem()
group_item['name'] = groups[i]['name']
group_item['chinese'] = groups[i]['chinese']
group_item['members'] = []
members = groups[i]['members']
for member in members:
yield Request(self.person_url % member['id'], meta={'group_item': group_item, 'member': member},
callback=self.parse_member, priority=100)
yield group_item
def parse_member(self, response):
group_item = response.meta['group_item']
member = response.meta['member']
person = json.loads(response.body)
ego = person['ego']
group_item['members'].append({
'id': ego['id'],
'name': ego['name'],
'chinese': ego['chinese'],
'role': member['role']
})
【问题讨论】:
-
也许将
yield group_item从parse()移动到parse_member() -
“似乎收益项总是在收益请求之前执行”是什么意思?也许您看到在看到打印到控制台的项目后收到了此请求的响应?在这种情况下,它是预期的
-
@furas 我曾尝试将
yield group _item移至parse_member(),但结果是{'A':1, 'members':[{'id':11}]}, {'A':1, 'members':[{'id':22}]}而不是{'A':1, 'members':[{'id':11}, {'id':22}]},我不知道如何解决。
标签: python web-crawler scrapy scrapy-spider