有谁知道将现有 csv 文件转换为 UTF-8 编码的简单函数？答案

【问题标题】：Does anyone know a simple function that converts existing csv files to UTF-8 encoding?有谁知道将现有 csv 文件转换为 UTF-8 编码的简单函数？
【发布时间】：2015-12-13 01:58:09
【问题描述】：

我有巨大的 csv 文件，它们包含 '\xc3\x84' 样式字符而不是德语变音符号，因为我使用 BeautifulSoup 废弃 HTML 并使用 Python 2.7.8 将其写入 csv 文件。

在这个帮助下，我设法替换了所有这些字符： Python 2.7.1: How to Open, Edit and Close a CSV file

现在我的代码如下所示：

import csv

new_rows = []
umlaut = {'\\xc3\\x84': 'Ä', '\\xc3\\x96': 'Ö', '\\xc3\\x9c': 'Ü', '\\xc3\\xa4': 'ä', '\\xc3\\xb6': 'ö', '\\xc3\\xbc': 'ü'}

with open('file1.csv', 'r') as csvFile:
    reader = csv.reader(csvFile)
    for row in reader:
        new_row = row
        for key, value in umlaut.items():
            new_row = [ x.replace(key, value) for x in new_row ]
        new_rows.append(new_row)

with open('file2.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(new_rows)

当我打开 csv 时，我看到的是 Köln 而不是 Köln 和其他“德语变音符号”问题。我可以通过用记事本打开 CSV 文件然后将其保存为 UTF-8 来手动解决这个问题，但我想用 python 自动完成。

我不太明白如何使用 UnicodeWriter：

https://docs.python.org/2/library/csv.html#examples

我在stackoverflow上找到的答案和解决方案都有点复杂。

我的问题是，在我的情况下，例如，我将如何使用 UnicodeWriter？你知道有什么超级简单的函数可以做类似 file2.encode('utf-8') 的事情吗？如果 Python 中不存在这么简单的 like 函数，那为什么它还不存在，因为编码错误很常见？

【问题讨论】：

您意识到您打开文件的位置的编码是问题吗？ '\xc3\x84' 是一个 utf-8 编码的字符串
我认为该文件已经是 utf-8 编码的。 '\\xc3\\x84' 是 'Ä' 的 utf-8 编码，因此将其中一个替换为另一个没有多大意义。当你“打开 csv 我看到 Köln” 你是怎么打开的？用记事本？我认为它使用您的本地代码页而不是 utf-8 进行解码。微软在其文件中包含一个称为 BOM 的编码提示，但美丽的汤没有。您可以发布您的编码（print sys.stdin.encoding），以便我可以尝试。而且，print codecs.open('file1.csv', encoding='utf-8').read() 是否正确打印字符？如果是这样，你已经是 utf-8 了。
使用 print sys.stdin.encoding 在我的控制台中输出为“cp850”

标签： python python-2.7 csv encoding utf-8

【解决方案1】：

你可以使用string-escape编码代替自己的映射：

>>> print '\\xc3\\x84'.decode('string-escape')
Ä

import csv

def iter_decode(it):
    for line in it:
        yield line.decode('string-escape')

with open('file1.csv') as csvFile, open('file2.csv', 'w') as f:
    reader = csv.reader(iter_decode(csvFile))
    writer = csv.writer(f)
    for row in reader:
        writer.writerow(row)

【讨论】：

我尝试了您的解决方案建议，但没有奏效。我认为是，因为 Excel 的默认编码是 ansi。您可能面临与我的代码相同的问题，因为编写部分需要使用 UnicodeWriter 之类的东西来完成。
to falsetru：那么我的映射也工作了吗？它真的只是关于我如何以及在哪里打开文件？ @Padriac Cunningham，我没有得到相同的输出，当我尝试 print '\xc3\x84' 时，我的控制台仍然打印出奇怪的迹象。
@dima，那是因为你的 shell 的编码不是 utf-8，这是一个 windows 问题而不是 python

【解决方案2】：

假设您有来自the docs 的 unicode 编写器：

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

像这样使用它：

from __future__ import unicode_lterals
import codecs
f = codecs.open("somefile.csv", mode='w', encoding='utf-8')
writer = UnicodeWriter(f)
for data in some_buffer:
    writer.writerow(data)

【讨论】：

您应该将docs 归因/链接到您从中提取UnicodeWriter 的位置。
推荐from __future__ import unicode_lterals不是个好主意。当用户寻求进一步帮助时，这会使用户感到困惑。