if-语句在scrapy中不起作用答案

【问题标题】：If- statement not working in scrapyif-语句在scrapy中不起作用
【发布时间】：2014-04-14 07:40:43
【问题描述】：

我已经构建了一个爬虫，使用 scrapy 爬入站点地图并从站点地图中的所有链接中抓取所需的组件。

class MySpider(SitemapSpider):
 name = "functie"
 allowed_domains = ["xyz.nl"]
 sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

 def parse(self, response): 
  item = MyItem()
  sel = Selector(response)

  item['url'] = response.url
  item['h1'] = sel.xpath("//h1[@class='no-bd']/text()").extract()
  item['jobtype'] = sel.xpath('//input[@name=".Keyword"]/@value').extract()
  item['count'] = sel.xpath('//input[@name="Count"]/@value').extract()
  item['location'] = sel.xpath('//input[@name="Location"]/@value').extract()
  yield item

item['location'] 在某些情况下可以有空值。在那种特殊情况下，我想抓取其他组件并将其存储在 item['location'] 中。我试过的代码是：

item['location'] = sel.xpath('//input[@name="Location"]/@value').extract()
if not item['location']:
 item['location'] = sel.xpath('//a[@class="location"]/text()').extract()

但它不检查 if 条件，如果位置输入字段中的值为空，则返回空。任何帮助都会非常有用。

【问题讨论】：

你确定条件没有被调用，或者第二个sel.xpath 是否也返回一个'null'值？您是否通过放置例如检查里面有打印声明？另外，那个“空值”到底是什么？
.extract() 返回一个列表。具有单个空字符串的列表被评估为True

标签： python if-statement scrapy web-crawler

【解决方案1】：

您可能希望改为检查item['location'] 的长度。

item['location'] = sel.xpath('//input[@name="Location"]/@value').extract()
if len(item['location']) < 1:
    item['location'] = sel.xpath(//a[@class="location"]/text()').extract()')

无论如何，您是否考虑过将两个 xpath 与 | 结合起来？

item['location'] = sel.xpath('//input[@name="Location"]/@value | //a[@class="location"]/text()').extract()'

【讨论】：

【解决方案2】：

试试这个方法：

if(item[location]==""):
     item['location'] = sel.xpath('//a[@class="location"]/text()').extract()

【讨论】：

【解决方案3】：

我认为您想要实现的目标最好通过自定义 item pipeline 来解决。

1) 打开 pipelines.py 并在 Pipeline 类中检查所需的 if 条件：

class LocPipeline(object):
    def process_item(self, item, spider):
        # check if key "location" is in item dict
        if not item.get("location"):
            # if not, try specific xpath
            item['location'] = sel.xpath('//a[@class="location"]/text()').extract()
        else:
            # if location was already found, do nothing
            pass

        return item

2) 下一步是将自定义 LocPipeline() 添加到您的 settings.py 文件中：

ITEM_PIPELINES = {'myproject.pipelines.LocPipeline': 300}

将自定义管道添加到您的设置中，scrapy 将在 MySpider().parse() 之后自动调用 LocPipeline().process_item() 并在尚未找到位置时搜索替代 XPath。

【讨论】：