Scrapy 选择器上的 extract_first() 和 extract() 方法没有返回相同的值答案

【问题标题】：extract_first() and extract() methods on Scrapy selectors are not returning the same valueScrapy 选择器上的 extract_first() 和 extract() 方法没有返回相同的值
【发布时间】：2018-04-30 09:24:18
【问题描述】：

我正在使用 Scrapy 从电影院网页收集数据。

使用 XPath 选择器，如果我将选择器与 extract() 方法一起使用，如下所示：

def parse_with_extract(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    data = i.xpath("text()").extract()
    return data

如果我将选择器与 extract_first() 方法一起使用：

def parse_with_extract_first(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    storage = []
    for i in div:
        data = i.xpath("text()").extract_first()
        storage.append(data)
    return storage

为什么 extract() 方法返回所有字符，包括 "\xa0"，而 extract_first() 方法返回的是空字符串？

【问题讨论】：

您能否提供一个指向您要抓取的页面的链接？
@StasDeep Here is the link

标签： python web-scraping scrapy

【解决方案1】：

如果您仔细查看响应，您会看到 @class=movie__option 元素如下所示：

'<p class="movie__option" style="color: #000;">\n                                    <strong>Thursday 3rd of May 2018:</strong>\n                                    11:20am\xa0 \xa0  \n                                </p>'

如果你提取这个元素的text()，你基本上会得到两个字符串：一个在strong标签之前，一个在之后（text()只接受一级文本）：

['\n                                    ',
 '\n                                    11:20am\xa0 \xa0  \n                                ']

extract_first 所做的只是获取这两个字符串中的第一个：

'\n                                    '

【讨论】：

非常感谢，我完全错过了。

【解决方案2】：

嗯，从你的输出来看，它看起来像下面这样：

['\n                                    ',
 '\n                                    11:20am\xa0 \xa0  \n                                ']

包含两个字符串。

我对所有获得相同数据（如换行和空格）的人的建议是，使用 Python 的内置方法 strip()。此方法适用于字符串。因此，您可以通过以下方式应用此方法：

data = response.xpath("//path/to/your/data").get().strip()

这将使您的输出看起来像这样：

'11:20am'

另外，看看 extract() 和 extract_first() 有什么区别。

```
extract()
```

此方法返回列表。这是 Scrapy 中的一种旧方法。现在使用的方法是 getall()，而不是 extract()。和extract()一样。

extract() -- 更新为 --> getall()

现在我们来看看 extract_first() 方法

```
extract_first()
```

此方法返回 str 而不是列表。这也是 Scrapy 中的老方法。现在使用的方法是 get()，而不是 extract_first()。

extract_first() -- 更新为 --> get()

【讨论】：