在 Scrapy 管道中删除重复项值答案

【问题标题】：Dropping duplicate item value in Scrapy pipeline在 Scrapy 管道中删除重复项值
【发布时间】：2015-11-24 05:31:01
【问题描述】：

我有一些以这种格式存储在 .json 文件中的结果：

（每行一项）

{"category": ["ctg1"], "pages": 3, "websites": ["x1.com","x2.com","x5.com"]}
{"category": ["ctg2"], "pages": 2, "websites": ["x1.com", "d4.com"]}
                    .
                    .

我试图删除重复值而不删除整个项目，但没有成功。

代码：

import scrapy
import json
import codecs
from scrapy.exceptions import DropItem

class ResultPipeline(object):

    def __init__(self):
        self.ids_seen = set()
        self.file = codecs.open('results.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        for sites in item['websites']:
            if sites in self.ids_seen:
                raise DropItem("Duplicate item found: %s" % sites)
            else:
                self.ids_seen.add(sites)
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

    def spider_closed(self, spider):
        self.file.close()

【问题讨论】：

您不能在for sites in item 循环中删除它。您可以创建一个重复项列表并将其在该循环之外删除。或者，您可以将 websites 容器设为 set 而不是 list。您可以使用OrderedDicst，如下所示：stackoverflow.com/questions/12878833/…
还是什么都没有。我已经尝试了几乎所有的链接。我相信以这种方式不可能实现它，也许我必须尝试一些不同的东西。不过你的回答很有用，谢谢。

标签： python web-crawler scrapy

【解决方案1】：

不要删除重复的项目，只需重建尚未在 ids_seen 列表中的站点列表。下面的示例代码应该可以工作，尽管它不在您的类结构中。

import json


line1 = '{"category": ["ctg1"], "pages": 3, "websites": ["x1.com","x2.com","x5.com"]}'
line2 = '{"category": ["ctg2"], "pages": 2, "websites": ["x1.com", "d4.com"]}'

lines = (line1, line2)

ids_seen = set()

def process_item(item):
    item_unique_sites = []
    for site in item['websites']:
        if not site in ids_seen:
            ids_seen.add(site)
            item_unique_sites.append(site)
    # Delete the duplicates
    item['websites'] = item_unique_sites
    line = json.dumps(dict(item), ensure_ascii=False) + "\n"
    print line
    #self.file.write(line)
    return item


for line in lines:
    json_data = json.loads(line)
    process_item(json_data)

【讨论】：