GZipStream 在大文件上悄悄失败，流以 2GB 结束答案

【问题标题】：GZipStream quietly fails on large file, stream ends at 2GBGZipStream 在大文件上悄悄失败，流以 2GB 结束
【发布时间】：2016-02-12 04:41:31
【问题描述】：

我在使用GZipStream 解压缩流过早结束的FreebaseRDF dump（30GB 压缩文本，480GB 未压缩）时遇到问题。没有抛出异常，只是 gz.Read() 开始返回零：

using(var gz = new GZipStream(File.Open("freebase-rdf-latest.gz", FileMode.Open), CompressionMode.Decompress))
{
    var buffer = new byte[1048576];
    int read, total = 0;
    while ((read = gz.Read(buffer, 0, buffer.Length)) > 0)
        total += read;

    // total is 1945715682 here
    // subsequent reads return 0
}

该文件可以很好地与其他应用程序一起解压（我尝试了 gzip 和 7zip）。

四处嗅探我在GZipStream documentation on MSDN之前的版本中发现了这个注释：

GZipStream 类可能无法解压缩导致的数据超过 8 GB 的未压缩数据。

该注释已在最新版本的文档中删除。我正在使用 .NET 4.5.2，对我来说，在解压缩不到 2GB 之后，流就结束了。

有人知道更多关于这个限制的信息吗？文档中的语言暗示了其他先决条件，而不仅仅是解压缩超过 8gb - 我相当肯定我过去曾使用 GZipStream 来处理非常大的文件而没有遇到这个问题。

另外，任何人都可以推荐一个替代 GZipStream 的替代品，我可能会使用它来代替 System.IO.Compression？

更新

我尝试用 Ionic.Zlib (DotNetZip) 替换 System.IO.Compression 并得到相同的结果。

我尝试了 ICSharpCode.SharpZipLib 的 GZipInputStream 并在第一次读取时得到“未知块类型 6”。

我尝试了 SevenZipSharp，但没有用于读取的流装饰器 - 只有各种阻塞的“提取”方法来解压整个流，这不是我想要的。

另一个更新

使用 zlib1.dll，以下代码可以正确解压整个文件。它的运行时间也是 GZipStream 的 1/4！

var gzFile = gzopen("freebase-rdf-latest.gz", "rb");

var buffer = new byte[1048576];
int read, total = 0;
while ((read = gzread(gzFile, buffer, buffer.Length)) > 0)
    total += read;

[DllImport("zlib1")] IntPtr gzopen(string path, string mode);
[DllImport("zlib1")] int gzread(IntPtr gzFile, byte[] buf, int len);
[DllImport("zlib1")] int gzclose(IntPtr gzFile);

..so 显然 .NET 中的所有现有 GZip 库都与 zlib 存在一些兼容性问题。我使用的 zlib1.dll 来自我的 mingw64 目录（我的机器上大约有十几个 zlib1.dll，但这是唯一的 64 位）。

【问题讨论】：

你编译的时候是x86还是x64？你也可以看看DeflateStream 吗？它在引擎盖下使用 zlib。不过，我不确定 DeflateStream 是否适用于您正在使用的内容。
@AdamSears x64 但我尝试了 32 位，但没有任何区别。 GZipStream encapsulates DeflateStream.

标签： c# .net gzipstream

【解决方案1】：

我有点晚了，但是我已经找到了这个问题的原因和解决方案。

这个大文件不仅包含一个 gzip 流，还包含 200 个流。（每个 gzip 流的压缩大小：150-155 MB）

第一个“gzip-file”使用可选的额外字段来描述所有压缩 gzip-stream 的长度。许多解压缩器不支持这种流式传输方式，并且只解码第一个条目。 (150 MB -> 2 GB)

1.: read-header-method: (对不起，如果看起来像黑客风格:-)

static long[] ReadGzipLengths(Stream stream)
{
  if (!stream.CanSeek || !stream.CanRead) return null; // can seek and read?

  int fieldBytes;
  if (stream.ReadByte() == 0x1f && stream.ReadByte() == 0x8b // gzip magic-code
      && stream.ReadByte() == 0x08 // deflate-mode
      && stream.ReadByte() == 0x04 // flagged: has extra-field
      && stream.ReadByte() + stream.ReadByte() + stream.ReadByte() + stream.ReadByte() >= 0 // unix timestamp (ignored)
      && stream.ReadByte() == 0x00 // extra-flag: sould be zero
      && stream.ReadByte() >= 0 // OS-Type (ignored)
      && (fieldBytes = stream.ReadByte() + stream.ReadByte() * 256 - 4) > 0 // length of extra-field (subtract 4 bytes field-header)
      && stream.ReadByte() == 0x53 && stream.ReadByte() == 0x5a // field-header: must be "SZ" (mean: gzip-sizes as uint32-values)
      && stream.ReadByte() + stream.ReadByte() * 256 == fieldBytes // should have same length
    )
  {
    var buf = new byte[fieldBytes];
    if (stream.Read(buf, 0, fieldBytes) == fieldBytes && fieldBytes % 4 == 0)
    {
      var result = new long[fieldBytes / 4];
      for (int i = 0; i < result.Length; i++) result[i] = BitConverter.ToUInt32(buf, i * sizeof(uint));
      stream.Position = 0; // reset stream-position
      return result;
    }
  }

  // --- fallback for normal gzip-files or unknown structures ---
  stream.Position = 0; // reset stream-position
  return new[] { stream.Length }; // return single default-length
}

2.: 阅读器

static void Main(string[] args)
{
  using (var fileStream = File.OpenRead(@"freebase-rdf-latest.gz"))
  {
    long[] gzipLengths = ReadGzipLengths(fileStream);
    long gzipOffset = 0;

    var buffer = new byte[1048576];
    long total = 0;

    foreach (long gzipLength in gzipLengths)
    {
      fileStream.Position = gzipOffset;

      using (var gz = new GZipStream(fileStream, CompressionMode.Decompress, true)) // true <- don't close FileStream at Dispose()
      {
        int read;
        while ((read = gz.Read(buffer, 0, buffer.Length)) > 0) total += read;
      }

      gzipOffset += gzipLength;

      Console.WriteLine("Uncompressed Bytes: {0:N0} ({1:N2} %)", total, gzipOffset * 100.0 / fileStream.Length);
    }
  }
}

3.：结果

Uncompressed Bytes: 1.945.715.682 (0,47 %)
Uncompressed Bytes: 3.946.888.647 (0,96 %)
Uncompressed Bytes: 5.945.104.284 (1,44 %)
...
...
Uncompressed Bytes: 421.322.787.031 (99,05 %)
Uncompressed Bytes: 423.295.620.069 (99,53 %)
Uncompressed Bytes: 425.229.008.315 (100,00 %)

需要一些时间（30-40 分钟），但它有效！（没有外部库）

速度：解压缩数据速率约为 200 MB/s

只需少量更改，应该可以实现多线程。

【讨论】：

我只是采用了 zlib，但很高兴有人终于解开了这个谜团，谢谢！
谢谢，这让我快疯了。相同的文件，不同的语言（Ruby），同样的问题。如果其他人偶然发现它并且在 UNIX 上，可以通过将 GzipReader.open(file) { |f| ... } 更改为 IO.popen(["/usr/bin/gzcat", file]) { |f| ... } 来解决。

【解决方案2】：

对于大文件，您不应该使用流阅读器：

        var buffer = new byte[1024 * 1024];
        using (var gz = new GZipStream(new FileStream("freebase-rdf-latest.gz", FileMode.Open), CompressionMode.Decompress))            
        {
            var bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                bytesRead = gz.Read(buffer, 0, buffer.Length);
                Console.WriteLine(bytesRead);
            }
        }

【讨论】：

我不知道，但是当我将 StreamReader 与 GZipStream 一起使用时，我遇到了问题。
没有 StreamReader 问题是可以重现的