【问题标题】:Pipeline to remove None values删除 None 值的管道
【发布时间】:2018-08-12 21:37:12
【问题描述】:

我的蜘蛛产生了某些数据,但有时它找不到数据。 而不是设置如下条件:

if response.xpath('//div[@id="mitten"]//h1/text()').extract_first():
    result['name'] = response.xpath('//div[@id="mitten"]//h1/text()').extract_first()

我宁愿在我的管道中通过删除所有具有None 值的项目来解决此问题。我尝试通过以下代码做到这一点:

class BasicPipeline(object):
    """ Basic pipeline for scrapers """

    def __init__(self):
        self.seen = set()

    def process_item(self, item, spider):
        item = dict((k,v) for k,v in item.iteritems() if v is not None)

        item['date'] = datetime.date.today().strftime("%d-%m-%y")
        for key, value in item.iteritems():
            if isinstance(value, basestring):
                item[key] = value.strip() # strip every value of the item

        # If an address is a list, convert it to a string
        if "address" in item:
            if isinstance(item['address'], list): # check if address is a list
                item['address'] = u", ".join(line.strip() for line in item['address'] if len(line.strip()) > 0)

        # Determine the currency of the price if possible
        if "price" in item:
            if u'€' in item['price'] or 'EUR' in item['price']:
                item['currency'] = 'EUR'
            elif u'$' in result['price'] or 'USD' in item['price']:
                item['currency'] = 'USD'

        # Extract e-mails from text
        if "email" in item:
            if isinstance(item['email'], list): # check if email is a list
                item['email'] = u" ".join(line.strip() for line in item['email']) # convert to a string
            regex = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
            item['email'] = u";".join(line.strip() for line in re.findall(regex, item['email']))
            if "mailto:" in item['email']:
                item['email'] = item.replace("mailto:","")

        if "phone" in item or "email" in item:
            return item
        else:
            DropItem("No contact details: %s" %item)

但是,这会导致错误:

2018-03-05 10:11:03 [scrapy] ERROR: Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x103c14dd0>>
Traceback (most recent call last):
  File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 57, in robustApply
    return receiver(*arguments, **named)
  File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 193, in item_scraped
    slot.exporter.export_item(item)
  File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/exporters.py", line 184, in export_item
    self._write_headers_and_set_fields_to_export(item)
  File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/exporters.py", line 199, in _write_headers_and_set_fields_to_export
    self.fields_to_export = list(item.fields.keys())
AttributeError: 'NoneType' object has no attribute 'fields'

我认为这与一个字段已屈服于管道但最终没有返回的事实有关,但这只是一个猜测。

目前流水线有如下语句:

if "website" in item:
    # Do stuff

我想避免添加不必要的额外语句来检查值是否为None

【问题讨论】:

    标签: python python-2.7 scrapy scrapy-pipeline


    【解决方案1】:

    如果您返回创建的项目,您当前的代码可能会起作用:

    def process_item(self, item, spider):
        item = dict((k,v) for k,v in item.iteritems() if v is not None)
        return item
    

    也就是说,我强烈建议在你的爬虫蜘蛛中使用 item loaders
    不为空数据创建字段只是众多好处之一。

    编辑:

    现在您已经包含了完整的管道代码,我可以看到错误出现在最后一行。
    您的代码创建一个异常对象,将其丢弃,并返回None;必须引发 DropItem 异常:

    raise DropItem("No contact details: %s" % item)
    

    【讨论】:

    • 那个来了。我的def process_item 在第一行之后继续,所以我已经返回它了。 (更新了问题以反映这一点)
    • 嗯,None 仍然以某种方式找到了进入项目导出器的方式。如果没有看到所有相关代码,很难确切地说出原因。
    • 完整管道现已添加到问题中。
    • 更新了答案。
    猜你喜欢
    • 1970-01-01
    • 2013-04-12
    • 2015-11-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-10-25
    • 1970-01-01
    • 2020-05-20
    相关资源
    最近更新 更多