在未知文件中检测（utf）编码的最佳方法[重复]

【问题标题】：Best way to detect (utf) encoding in unknown file [duplicate]在未知文件中检测（utf）编码的最佳方法[重复]
【发布时间】：2018-12-19 20:15:30
【问题描述】：

这是我目前用来打开用户拥有的各种文件的工具：

# check the encoding quickly
with open(file, 'rb') as fp:
    start_data = fp.read(4)
    if start_data.startswith(b'\x00\x00\xfe\xff'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xff\xfe\x00\x00'):
        encoding = 'utf-32'
    elif start_data.startswith(b'\xfe\xff'):
        encoding = 'utf-16'
    elif start_data.startswith(b'\xff\xfe'):
        encoding = 'utf-16'
    else:
        encoding = 'utf-8'            

# open the file with that encoding
with open(file, 'r', encoding=encoding) as fp:
    do_something()

是否有比上述更好的方法来正确打开未知的 utf 文件？

【问题讨论】：

标签： python csv unicode byte-order-mark

【解决方案1】：

如果您知道它是utf，您可以使用chardet 执行以下操作：

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

with open(file, 'rb') as fp:
    detector.feed(fp.read(1000))
    detector.close()
    raw = detector.result['encoding'].lower()
    encoding = 'utf-32' if ('utf-32' in raw) else 'utf-16' if ('utf-16' in raw) else 'utf-8'

注意：尝试magic 或Determine the encoding of text in Python 问题中提到的其他一些库不起作用。另外请注意，很多时候该文件位于utf-8 中，它会被标记为ascii。

【讨论】：