在抓取 html 时混合扩展 ascii 和普通字符串答案

【问题标题】：mixed extended ascii and normal string while scraping html在抓取 html 时混合扩展 ascii 和普通字符串
【发布时间】：2017-06-20 11:20:15
【问题描述】：

我正在学习scrapy(https://doc.scrapy.org/en/1.3/intro/tutorial.html)的文档，但是有一段代码，我的电脑生成的结果不一样。

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

正确的输出应该是：

{"text": ""我们创造的世界是我们思考的过程。不改变我们的想法就无法改变。"", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]}

但我的输出（在 json 中）是：

{"text": "\u201c我们创造的世界是我们思考的过程。不改变我们的想法就无法改变它。\u201d", "author": "Albert Einstein", "tags": [“改变”、“深思”、“思考”、“世界”]}

当我使用 scrapy shell 或尝试输出 json 文件时会发生这种情况。但是如果我选择输出到 csv，它可以正常工作。任何人有解决方案？

环境：Ubuntu、python 3.5

【问题讨论】：

标签： python scrapy

【解决方案1】：

首先它可以这样编码，大多数将加载它的程序都会在必要时对其进行解码。

如果您坚持以另一种方式编码 JSON 输出，您可以使用 Scrapy 的 FEED_EXPORT_ENCODING 设置 as stated here。

我猜你正在寻找的是 FEED_EXPORT_ENCODING = 'utf-8'（在你的 settings.py 文件中）

【讨论】：

非常感谢！困扰了我一阵子！你的建议非常有效！ :)