【发布时间】:2017-05-15 00:35:39
【问题描述】:
问题
在下面的编辑部分查看更新的问题
我正在尝试使用 GZIPInputStream 即时从 Amazon S3 解压缩大型 (~300M) GZIPed 文件,但它只输出文件的一部分;但是,如果我在解压前下载到文件系统,那么 GZIPInputStream 将解压整个文件。
如何让 GZIPInputStream 解压缩整个 HTTPInputStream 而不仅仅是它的第一部分?
我的尝试
查看下方编辑部分的更新
我怀疑是 HTTP 问题,除了没有抛出异常,GZIPInputStream 每次都返回相当一致的文件块,据我所知,它总是在 WET 记录边界上中断,尽管它选择的边界是每个 URL 都不同(这很奇怪,因为所有内容都被视为二进制流,根本没有解析文件中的 WET 记录。)
我能找到的最接近的问题是 GZIPInputStream is prematurely closed when reading from s3 这个问题的答案是,一些 GZIP 文件实际上是多个附加的 GZIP 文件,而 GZIPInputStream 不能很好地处理。但是,如果是这种情况,为什么 GZIPInputStream 可以在文件的本地副本上正常工作?
演示代码和输出
下面是一段示例代码,演示了我看到的问题。我已经在两个不同网络上的两台不同 Linux 计算机上使用 Java 1.8.0_72 和 1.8.0_112 对其进行了测试,结果相似。我希望解压后的 HTTPInputStream 的字节数与解压后的本地文件副本的字节数相同,但解压后的 HTTPInputStream 会小得多。
输出Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 87894 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 1772936 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 89217 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
示例代码
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;
public class GZIPTest {
public static void main(String[] args) throws Exception {
// Our three test files from CommonCrawl
URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");
/*
* Test the URLs and display the results
*/
test(url0, "testfile0.wet");
System.out.println("------");
test(url40, "testfile40.wet");
System.out.println("------");
test(url500, "testfile500.wet");
}
public static void test(URL url, String testGZFileName) throws Exception {
System.out.println("Testing URL "+url.toString());
// First directly wrap the HTTPInputStream with GZIPInputStream
// and count the number of bytes we read
// Go ahead and save the extracted stream to a file for further inspection
System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
int bytesFromGZIPDirect = 0;
URLConnection urlConnection = url.openConnection();
FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
// FIRST TEST - Decompress from HTTPInputStream
GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream());
byte[] buffer = new byte[1024];
int bytesRead = -1;
while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPDirect += bytesRead;
directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
}
gzipishttp.close();
directGZIPOutStream.close();
// Now save the GZIPed file locally
System.out.println("Testing saving to file before decompression");
int bytesFromGZIPFile = 0;
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
outputStream.close();
// SECOND TEST - decompress from FileInputStream
GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));
buffer = new byte[1024];
bytesRead = -1;
while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPFile += bytesRead;
}
gzipis.close();
// The Results - these numbers should match but they don't
System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
}
}
编辑
根据@VGR 的评论,演示代码中的已关闭流和相关通道。
更新:
问题似乎与文件有关。我在本地提取了 Common Crawl WET 存档(wget),解压缩它(gunzip 1.8),然后重新压缩它(gzip 1.8)并重新上传到 S3,然后即时解压缩工作正常。如果您修改上面的示例代码以包含以下行,您可以看到测试:
// Original file from CommonCrawl hosted on S3
URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
// Recompressed file hosted on S3
URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
test(originals3, "originalhost.txt");
test(rezippeds3, "rezippedhost.txt");
URL rezippeds3 指向我下载、解压缩和重新压缩,然后重新上传到 S3 的 WET 存档文件。您将看到以下输出:
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 7212400 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file originals3.txt
-----
Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file rezippeds3.txt
正如您所看到的,一旦文件被重新压缩,我就能够通过 GZIPInputStream 流式传输它并获取整个文件。原始文件仍然显示解压缩通常提前结束。当我下载并上传 WET 文件而不重新压缩它时,我得到了相同的不完整流式传输行为,所以它肯定是修复它的重新压缩。我还将原始文件和重新压缩后的两个文件都放到了传统的 Apache Web 服务器上,并且能够复制结果,因此 S3 似乎与问题无关。
所以。我有一个新问题。
新问题
为什么在读取相同内容时 FileInputStream 的行为与 HTTPInputStream 不同。如果它是完全相同的文件为什么:
新的 GZIPInputStream(urlConnection.getInputStream());
表现与
不同new GZIPInputStream(new FileInputStream("./test.wet.gz"));
??输入流不就是输入流吗??
【问题讨论】:
-
关于“将 GZIPed 文件保存到本地”代码:通道需要关闭,就像 InputStreams 和 OutputStreams 一样。
-
OpenJDK 错误JDK-8081450 看起来是同一个问题。
标签: java amazon-s3 gzipinputstream