写入文件时的Python UnicodeEncodeError答案

【问题标题】：Python UnicodeEncodeError when writing to file写入文件时的Python UnicodeEncodeError
【发布时间】：2017-11-03 09:51:49
【问题描述】：

我正在使用 python 库“pdfminer.six”从我拥有的几个 PDF 中提取所有文本。我的方法完美无缺，但是对于一些可能有一些特殊字符的 pdf，当我将其写入文本文件时，我得到“Unicode Encode Error: 'charmap' codec can't encode character '\u03b2'在位置 271130：字符映射到 ". 现在，我知道“正在”发生什么，但我想知道如何以最好的方式对待它。这是让我头疼的部分：

    with open("newTxtFile.txt", "w") as textFile:
        textFile.write(text)

由于我来自巴西并且文本是葡萄牙语，所以我想保留所有重音，所以我在 pdfminer 中使用“codec = 'latin-1'”。据我所知，在保存之前打印，直到最后都完美无缺，但是每当我尝试保存到文件时，我都会收到 UnicodeEncodeError。

我想到的两个选择是：要么我找到一种方法来只捕捉给我带来麻烦的特定角色：

    with open("newTxtFile.txt", "w") as textFile:
    try:
        textFile.write(text)
    except UnicodeEncodeError:
        ????

但我不知道在 except 中应该是什么？

或者我应该以不同的方式保存到文件中。

谁能给我一些建议？非常感谢！

【问题讨论】：

你得到什么实际的UnicodeEncodeError？
@MaximTitarenko Unicode 编码错误：'charmap' 编解码器无法对位置 271130 中的字符 '\u03b2' 进行编码：字符映射到未定义"

标签： python file unicode pdfminer

【解决方案1】：

尝试：

with open("newTxtFile.txt", "wb") as textFile:
    textFile.write(text.encode('utf8'))

阅读：

with open("newTxtFile.txt", "rb") as textFile:
    text = textFile.read().decode('utf8')

【讨论】：

我的文本类型是 Str。如果我使用这种编码，它会给我“TypeError：write() 参数必须是 str，而不是字节”。另外，utf-8 不会把葡萄牙语中的所有重音和特殊字符都弄乱了吗？
@fallremix utf-8 是最完整的编码之一，所以我认为它不会搞砸我用“wb”而不是“w”打开模式编辑我的帖子
有效！我只需要事后解码它。有点奇怪，因为几乎我所有的其他 pdf 文件都已正确保存，只有少数让我感到头疼。
不需要显式编码/解码。使用 with open(...,encoding='utf8') (Python 3) 或 with io.open(...,encoding='utf8') (Python 2 或 3) 并读/写 Unicode 字符串。