Python BeautifulSoup 在写入文件时创建奇怪的 \xe2 unicode 字符答案

【问题标题】：Python BeautifulSoup creating weird \xe2 unicode characters when written to filePython BeautifulSoup 在写入文件时创建奇怪的 \xe2 unicode 字符
【发布时间】：2018-01-20 22:22:46
【问题描述】：

我正在使用 BeautifulSoup 解析我使用 WGet 在本地下载的一堆网页。

我正在像这样读取文件：

file = open(file_name, 'r', encoding='utf-8').read()
soup = BeautifulSoup(file, 'html5lib')

我正在使用这个 soup 对象来获取文本，然后我将其写入 .json 文件，如下所示：

f.write('"text": "' + str(text.encode('utf-8')) )

但是，当我打开 .json 文件时，我会看到如下字符串：

and\xe2\x80\x94in spite of

He hadn\xe2\x80\x99t shaved in a few days at least

and Michael can go.\xe2\x80\x9d\xc2\xa0 Her voice

我知道这些奇怪的字符不是 UTF-8，所以 python 不知道如何处理它们。但我不知道如何解决这个问题。

感谢您的帮助。

编辑：我正在使用 python3

此外，如果我在编写文本之前删除了对文本进行编码的部分，则会收到以下错误： UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 264: ordinal not in range(128)

【问题讨论】：

您是否以 UTF-8 编码打开文件？
看起来你正在使用 Python 3。你应该总是在 Unicode 问题中提到 Python 版本，因为 Python 2 和 3 在这方面有很大的不同。但无论如何，像\xe2\x80\x94 这样的十六进制序列实际上是有效的 UTF-8 多字节序列。正确解码后，它们变为and—in spite ofHe hadn’t shaved in a few days at leastand Michael can go.” Her voice。我使用此代码执行该转换：s.encode('latin1').decode()。但我不知道 BeautifulSoup，所以我不能告诉你解决这个问题的正确方法。
推荐阅读：joelonsoftware.com/2003/10/08/…
还有：nedbatchelder.com/text/unipain.html

标签： python json unicode beautifulsoup

【解决方案1】：

使用str(text.encode('utf-8')) 你会得到：

>>> text = 'He hadn’t shaved in a few days'
>>> text.encode('utf8')
b'He hadn\xe2\x80\x99t shaved in a few days'
>>> str(text.encode('utf8'))
"b'He hadn\\xe2\\x80\\x99t shaved in a few days'"
>>> print(str(text.encode('utf8')))
b'He hadn\xe2\x80\x99t shaved in a few days'

所以你得到的正是你无意写入文件的内容。

不要手动构建 JSON，而是使用 json 模块。给定 UTF-8 编码的输入：

<html>
<p>He hadn’t shaved in a few days</p>
</html>

然后：

from bs4 import BeautifulSoup
import json

# Good practice:
# Decode text data to Unicode when read into a program.
# Process text as Unicode in the program.
# Encoded text when leaving the program, such as:
#    Writing to database.
#    Sending over a network socket.
#    Writing to a file.

# Read the content as Unicode text.
with open('test.html','r',encoding='utf8') as file:
    content = file.read()
soup = BeautifulSoup(content)
text = soup.find('p').text    # Unicode string!

# Build the dictionary to be written in JSON format.
# Leave as Unicode!
items = {'text':text}

# Output as UTF-8-encoded data.
#
# ensure_ascii=False makes the non-ASCII characters in the file readable,
# but it works without it.  The file will just have Unicode escapes.
#
with open('out.json','w',encoding='utf8') as out:
    json.dump(items,out,ensure_ascii=False)


# Read and decode the data back from the file and turn it back into 
# a dictionary.
with open('out.json','r',encoding='utf8') as file:
    data = json.load(file)

print(data)

输出（Python dict）：

{'text': 'He hadn’t shaved in a few days'}

ensure_ascii=True时的文件内容：

{"text": "He hadn’t shaved in a few days"}

ensure_ascii=False时的文件内容：

{"text": "He hadn\u2019t shaved in a few days"}

【讨论】：

我试过了，但这给了我以下错误：json.dump(items,f,ensure_ascii=False) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3 .4/json/__init__.py"，第 179 行，转储 fp.write(chunk) UnicodeEncodeError：'ascii' 编解码器无法在位置 258 编码字符 '\u2019'：序数不在范围内（128）
没关系。我能够通过使用 codecs.open(file_name,'w', encoding="utf-8") 打开我正在写入的文件来解决这个问题

【解决方案2】：

简化您的写作：f.write('"text": "' + text)（或f.write('"text": "' + soup.prettify()）。您正在对已经编码的材料进行编码。

使用版本 4.6.0：https://pypi.python.org/pypi/beautifulsoup4/

使用 python3 -- 您会发现 str 诊断比在 python2 中更有帮助，它们提供了关于何时编码或解码的更好指导。

【讨论】：

我假设 OP 已经在使用 Python 3，因为 open(file_name, 'r', encoding='utf-8') 在 Python 2 中不起作用；至少，标准的 open 内置函数不支持 Python 2 中的 encoding 关键字 arg（尽管还有其他 opens 支持）。
如果我美化汤，它会变成一个字符串。我没有在问题中显示这一点，但文本是从 HTML 标签中获取的，这就是我需要实际的汤对象的原因。此外，我在编写文本时尝试删除编码，但它产生了一个错误，我只是将其编辑到原始问题中。
你没有向我们展示你是如何open'd f。听起来您的 open 选择了（默认）ascii 编解码器而不是 utf8 编解码器。
Mark Tolonen 的代码非常好。也许最好的部分是评论块。请务必遵循“良好做法”的建议。如果您不确定目前拥有什么样的对象，您可以查看type(text)。也调用 encode 或 decode 并查看该结果的类型。