【发布时间】:2016-09-14 06:32:23
【问题描述】:
我在 OSX El Capitan 10.11.2 上的 Python 2.7.10 上使用 Pandas 0.18.1,如果我没有设置 engine='python',则无法读取带有 read_csv() 的 UTF-16 文件。
文档指出 Python 解析器的功能更加完善,因此 Pandas 可能会尝试默认使用 C 解析器,并且它还不支持 UTF-16。有人可以确认是否是这种情况,或者这里是否发生了其他事情?
下面是一个最小的复制场景:
alanwagner : ~ ∴ pip2.7 freeze | grep pandas
pandas==0.18.1
alanwagner : ~ ∴ cat test.csv
col1,col2
val1,val2
alanwagner : ~ ∴ python
Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.read_csv('test.csv', encoding='utf8').to_csv('test-utf16.csv', encoding='utf16', index=False)
>>>
alanwagner : ~ ∴ cat test-utf16.csv
??col1,col2
val1,val2
alanwagner : ~ ∴ python
Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.read_csv('test-utf16.csv', encoding='utf16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
self._make_engine(self.engine)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1213, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 520, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5129)
File "pandas/parser.pyx", line 701, in pandas.parser.TextReader._get_header (pandas/parser.c:7665)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position 2: truncated data
>>> pd.read_csv('test-utf16.csv', encoding='utf16', engine='python')
col1 col2
0 val1 val2
>>>
我可以通过在将文件加载到 Pandas DataFrame 之前将文件从 UTF-16 转换为 UTF-8 来解决此问题。
【问题讨论】:
-
只是好奇,如果你尝试使用
encoding='ISO-8859-1'或encoding='cp1252'会发生什么(我认为1252 肯定会失败,但只是在黑暗中开枪)。