如何使用 linux cmd 提示符使用 unicode 编码压缩文件？答案

【问题标题】：How to gzip file with unicode encoding using linux cmd prompt?如何使用 linux cmd 提示符使用 unicode 编码压缩文件？
【发布时间】：2016-08-31 13:08:22
【问题描述】：

我有大的 tsv 格式文件（30GB）。我必须将所有这些数据转换为 google bigquery。因此，我将文件拆分成更小的块并 gzip 所有这些块文件并移动到谷歌云存储。之后，我调用了 google bigquery api 从 GCS 加载数据。但我面临以下编码错误。

file_data.part_0022.gz: Error detected while parsing row starting at position: 0. Error: Bad character (ASCII 0) encountered. (error code: invalid)

我在我的 python 代码中使用以下 unix 命令来执行拆分和 gzip 任务。

cmd = [
            "split",
            "-l",
            "300000",
            "-d",
            "-a",
            "4",
            "%s%s" % (<my-dir>, file_name),
            "%s/%s.part_" % (<my temp dir>, file_prefix)
        ]

code = subprocess.check_call(cmd)
cmd = 'gzip %s%s/%s.part*' % (<my temp dir>,file_prefix,file_prefix)
logging.info("Running shell command: %s" % cmd)
code = subprocess.Popen(cmd, shell=True)
code.communicate()

文件已成功拆分和压缩（file_data.part_0001.gz、file_data.part_0002.gz 等），但是当我尝试将这些文件加载到 bigquery 时，它会引发上述错误。我知道那是编码问题。拆分和 gzip 操作时有什么方法可以对文件进行编码？或者我们需要使用python文件对象逐行读取并进行unicode编码并将其写入新的gzip文件？（pythonic方式）

【问题讨论】：

标签： linux python-2.7 unicode google-bigquery google-cloud-storage

【解决方案1】：

原因：

错误：遇到错误字符 (ASCII 0)

明确指出您有一个无法解码的 unicode (UTF-16) 制表符。 BigQuery 服务仅支持 UTF-8 和 latin1 文本编码。因此，该文件应该是UTF-8 编码的。

解决方案：我没有测试过。将-a 或--ascii 标志与gzip 命令一起使用。它会被 bigquery 解码。

【讨论】：