在 Python 3.6 中提取没有 BOM 的 gzip 文件答案

【问题标题】：Extract gzip file without BOM in Python 3.6在 Python 3.6 中提取没有 BOM 的 gzip 文件
【发布时间】：2018-03-07 08:48:39
【问题描述】：

我想在一个文件夹中解压缩子文件夹中有多个 gzfile。它工作正常，但我想删除的每个文件的开头都有一个 BOM 签名。我检查了其他问题，例如Removing BOM from gzip'ed CSV in Python 或Convert UTF-8 with BOM to UTF-8 with no BOM in Python，但它似乎不起作用。我在 Windows 上的 Pycharm 中使用 Python 3.6。

这是我没有尝试的代码：

import gzip
import pickle
import glob


def save_object(obj, filename):
    with open(filename, 'wb') as output:  # Overwrites any existing file.
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)


output_path = 'path_out'

i = 1

for filename in glob.iglob(
        'path_in/**/*.gz', recursive=True):
    print(filename)
    with gzip.open(filename, 'rb') as f:
        file_content = f.read()
    new_file = output_path + "z" + str(i) + ".txt"
    save_object(file_content, new_file)
    f.close()
    i += 1

现在，如果我将file_content = f.read() 替换为file_content = csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())，则使用Removing BOM from gzip'ed CSV in Python 中定义的逻辑（至少我对它的理解），我得到：

TypeError: can't pickle _csv.reader objects

我检查了这个错误（例如"Can't pickle <type '_csv.reader'>" error when using multiprocessing on Windows），但没有找到可以应用的解决方案。

【问题讨论】：

“似乎不起作用”究竟如何？您当前的代码似乎没有包含任何尝试。
由于我尝试了多种解决方案，我认为显示干净的代码更容易获得反馈。
这正是问题所在——准确地向我们展示您尝试了什么以及它是如何失败的。
我已经更新了我的描述。
如果您的输入不是 CSV，您不应该在刚刚成功转换的文本数据上使用 csv.reader()。尝试pickle 可能表明存在更根本的误解。

标签： python byte-order-mark

【解决方案1】：

对您链接到的第一个问题进行小幅改编。

tripleee$ cat bomgz.py
import gzip
from subprocess import run

with open('bom.txt', 'w') as handle:
    handle.write('\ufeffmoo!\n')

run(['gzip', 'bom.txt'])

with gzip.open('bom.txt.gz', 'rb') as f:
    file_content = f.read().decode('utf-8-sig')
with open('nobom.txt', 'w') as output:
    output.write(file_content)

tripleee$ python3 bomgz.py

tripleee$ gzip -dc bom.txt.gz | xxd
00000000: efbb bf6d 6f6f 210a                      ...moo!.

tripleee$ xxd nobom.txt
00000000: 6d6f 6f21 0a                             moo!.

pickle 部分在这里似乎无关紧要，但可能掩盖了从 bytes 的编码 blob 中获取解码的 str 块的目标。

【讨论】：

好的，由于我不是 Python 专家，所以我花了一些时间来理解您的回复。在我的代码中读取和写入工作的最后两个“with”操作。现在里面没有BOM。谢谢！