如何用python检测正确的文件编码？答案

【问题标题】：How to detect the right file encoding with python?如何用python检测正确的文件编码？
【发布时间】：2021-03-26 23:21:51
【问题描述】：

我尝试使用 pandas.read_csv() 在 python 中读取带有 pandas 的 csv（来自 https://openwrt.org/_media/toh_dump_tab_separated.zip）。问题是文件的编码。它不是 UTF-8，也不是 Latin1。而且我不想手动浏览所有编解码器 (https://docs.python.org/3/library/codecs.html#standard-encodings)。

解决方法是在 Libre Office 中打开文件，用“-”替换奇怪的字符，保存为 Latin1 并在 Python 中打开。

如何仅在 Python 中执行此操作？

以下代码和错误是我当前使用 UTF-8 的状态：

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'utf-8')

(...)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 983: invalid start byte

和拉丁语1：

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'Latin1')

(...)

ParserError：标记数据时出错。 C 错误：第 3 行中应有 1 个字段，但看到了 2

【问题讨论】：

编码似乎是cp1252。

标签： python character-encoding

【解决方案1】：

使用sep parameter:

import pandas as pd
df = pd.read_csv('ToH_dump_tab_separated.csv', encoding = 'cp1252', sep='\t')
print(df)

          pid  ...                                           comments
0       16132  ...                                                NaN
1       16133  ...                                                NaN
2       16134  ...                                                NaN
3       16135  ...                           Clone of Aztech HW550-3G
4       16137  ...  Image build disabled in master with commit d7d...
...       ...  ...                                                ...
1759  9726386  ...                                                NaN
1760  9878711  ...  Rough edges as of December 2020. Realtek targe...
1761  9912125  ...  Works with WL-WN575A3 image according OpenWrt ...
1762  9927580  ...                                                NaN
1763  9946488  ...                                                NaN

[1764 rows x 67 columns]

仅供参考，奇怪字符0xbf是¿ 倒问号 U+00BF（或\u00BF）：

print( df.switch[:2]); print( df.fccid[-2:])

0    Infineon ADM6996I
1                    ¿
Name: switch, dtype: object
1762                    http://¿
1763    https://fcc.io/Q87-03331
Name: fccid, dtype: object

编辑（tnx Mark Tolonen）。编码似乎是cp1252。某些字段中有智能引号：

print( df.comments[254][288:])

Ignore the “HW v” on the label - it may not say 2 for v2 hardware

【讨论】：

编码似乎是cp1252。某些字段中有智能引号。
感谢您的帮助！但是你怎么知道它是 cp1252，@MarkTolonen？只需掷骰子魔法，或者看清楚字符并熟悉编解码器？
@Cyoux 选项 2。我加载了数据并创建了一个内容的 set 减去 ASCII 字符的 set，并在以 @ 打开时留下了一些法语口音和智能引号987654338@。 latin1 不支持智能引号。 1252 是美国和西欧 Windows 的通用编解码器。
@Cyoux 阅读此主题：What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?