使用 Python 写入文本文件时的编码问题答案

【问题标题】：Encoding issue when writing to text file, with Python使用 Python 写入文本文件时的编码问题
【发布时间】：2013-04-18 04:27:09
【问题描述】：

我正在编写一个程序来“手动”将 csv 文件排列为正确的 JSON 语法，使用一个简短的 Python 脚本。从输入文件中，我使用 readlines() 将文件格式化为行列表，我对其进行操作并合并为单个字符串，然后将其输出到单独的 .txt 文件中。但是，输出包含乱码，而不是输入文件中存在的希伯来语字符，并且输出是水平双倍行距的（在每个字符之间添加一个空白字符）。据我所知，问题与编码有关，但我无法弄清楚是什么。当我检测到输入和输出文件的编码（使用.encoding 属性）时，它们都返回None，这意味着它们使用系统默认值。技术细节：Python 2.7、Windows 7。

虽然有很多关于这个主题的问题，但我没有找到我的问题的直接答案。在这种情况下，检测系统默认值对我没有帮助，因为我需要程序是可移植的。

代码如下：

def txt_to_JSON(csv_list):
    ...some manipulation of the list...
    return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list 
for i in range(0,len(lines)):
    lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()

【问题讨论】：

值得注意的是，在Python中处理文件时，最好使用the with statement。
你知道输入文件的编码是什么吗？
@PauloBu 他正在阅读希伯来语字符，但他在他的程序中使用 ASCII。这很可能是问题所在。
什么版本的 Python？
我很高兴。如果您想了解一些背景知识向您的领导解释这些链接将非常有帮助，特别是第一个：joelonsoftware.com/articles/Unicode.html、stackoverflow.com/questions/3951722/… 和 stackoverflow.com/questions/643694/utf-8-vs-unicode

标签： python encoding

【解决方案1】：

所有数据都需要经过编码才能存储在磁盘上。如果你不知道编码，你能做的最好的就是猜测。有一个图书馆：https://pypi.python.org/pypi/chardet

我强烈推荐 Ned Batchelder 的演讲 http://nedbatchelder.com/text/unipain.html 了解详情。

有一个关于在windows上使用“unicode”作为编码的解释：What's the difference between Unicode and UTF-8?

TLDR： Microsoft 使用 UTF16 作为 unicode 字符串的编码，但决定将其称为“unicode”，因为他们也在内部使用它。

即使 Python2 在字符串/unicode 转换方面有点宽松，您也应该习惯于始终对输入进行解码和对输出进行编码。

你的情况

filename = 'where your data lives'
with open(filename, 'rb') as f:
   encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")

# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)

encoded_result = result.encode("UTF-16")  #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
    f.write(encoded_result)

【讨论】：

感谢您的意见。然而，当我这样做时，输出文件（由f.write() 创建）仍然被编码为 ANSI，所以当它到达希伯来字符时我得到 UnicodeEncodeError。顺便说一句，utf_16 是正确的符号。
按照您的链接，我将编码从 'utf_16' 更改为 'utf_16_le'，并得到了类似的错误，仅与文件的开头有关，而不是与非 ascii 字符有关。
你用什么程序打开输出文件？
我使用记事本。这将如何影响编码？
程序必须解码文件以解释其中的内容。您可以将这两个文件或带有废话原始文本的类似文件放在某个地方吗？我想看看

【解决方案2】：

您需要告诉 Python 使用 Unicode 字符编码来解码希伯来字符。以下是如何在 Python 中读取 Unicode 字符的链接：Character reading from file in Python

【讨论】：

抱歉，我没有找到解决方案。我尝试使用 codecs 模块，但输出没有任何变化。