你的代码有两个问题:
您将整个文件一次加载到内存中,假设它是一行,因此您至少需要 200MB 的堆空间;和
添加换行符以使用这样的正则表达式是一种非常低效的方法。直接的代码解决方案将快一个数量级。
这两个问题都很容易解决。
使用 FileReader 和 FileWriter 一次加载 309 个字符,添加一个换行符并将其写出。
更新:添加了逐个字符和缓冲读取的测试。缓冲读取实际上增加了很多复杂性,因为您需要满足可能(但通常非常罕见)的情况,即read() 返回的字节少于您要求的并且仍有字节要读取.
首先是简单的版本:
private static void charRead(boolean verifyHash) {
Reader in = null;
Writer out = null;
long start = System.nanoTime();
long wrote = 0;
MessageDigest md = null;
try {
if (verifyHash) {
md = MessageDigest.getInstance("SHA1");
}
in = new BufferedReader(new FileReader(IN_FILE));
out = new BufferedWriter(new FileWriter(CHAR_FILE));
int count = 0;
for (int c = in.read(); c != -1; c = in.read()) {
if (verifyHash) {
md.update((byte) c);
}
out.write(c);
wrote++;
if (++count >= COUNT) {
if (verifyHash) {
md.update((byte) '\n');
}
out.write("\n");
wrote++;
count = 0;
}
}
} catch (IOException e) {
throw new RuntimeException(e);
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
} finally {
safeClose(in);
safeClose(out);
long end = System.nanoTime();
System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
}
}
还有“块”版:
private static void blockRead(boolean verifyHash) {
Reader in = null;
Writer out = null;
long start = System.nanoTime();
long wrote = 0;
MessageDigest md = null;
try {
if (verifyHash) {
md = MessageDigest.getInstance("SHA1");
}
in = new BufferedReader(new FileReader(IN_FILE));
out = new BufferedWriter(new FileWriter(BLOCK_FILE));
char[] buf = new char[COUNT + 1]; // leave a space for the newline
int lastRead = in.read(buf, 0, COUNT); // read in 309 chars at a time
while (lastRead != -1) { // end of file
// technically less than 309 characters may have been read
// this is very unusual but possible so we need to keep
// reading until we get all the characters we want
int totalRead = lastRead;
while (totalRead < COUNT) {
lastRead = in.read(buf, totalRead, COUNT - totalRead);
if (lastRead == -1) {
break;
} else {
totalRead++;
}
}
// if we get -1, it'll eventually signal an exit but first
// we must write any characters we have read
// note: it is assumed that the trailing number, which may be
// less than 309 will still have a newline appended. this may
// note be the case
if (totalRead == COUNT) {
buf[totalRead++] = '\n';
}
if (totalRead > 0) {
out.write(buf, 0, totalRead);
if (verifyHash) {
md.update(new String(buf, 0, totalRead).getBytes("UTF-8"));
}
wrote += totalRead;
}
// don't try and read again if we've already hit EOF
if (lastRead != -1) {
lastRead = in.read(buf, 0, 309);
}
}
} catch (IOException e) {
throw new RuntimeException(e);
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
} finally {
safeClose(in);
safeClose(out);
long end = System.nanoTime();
System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
}
}
以及创建测试文件的方法:
private static void createFile() {
Writer out = null;
long start = System.nanoTime();
try {
out = new BufferedWriter(new FileWriter(IN_FILE));
Random r = new Random();
for (int i = 0; i < SIZE; i++) {
out.write(CHARS[r.nextInt(CHARS.length)]);
}
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
long end = System.nanoTime();
System.out.printf("Created %s size %,d in %,.3f seconds%n",
IN_FILE, SIZE, (end - start) / 1000000000.0d);
}
}
这些都假设:
private static final int SIZE = 200000000;
private static final int COUNT = 309;
private static final char[] CHARS;
private static final char[] BYTES = new char[]{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
private static final String IN_FILE = "E:\\temp\\in.dat";
private static final String CHAR_FILE = "E:\\temp\\char.dat";
private static final String BLOCK_FILE = "E:\\temp\\block.dat";
static {
char[] chars = new char[1000];
int nchars = 0;
for (char c = 'a'; c <= 'z'; c++) {
chars[nchars++] = c;
chars[nchars++] = Character.toUpperCase(c);
}
for (char c = '0'; c <= '9'; c++) {
chars[nchars++] = c;
}
chars[nchars++] = ' ';
CHARS = new char[nchars];
System.arraycopy(chars, 0, CHARS, 0, nchars);
}
运行此测试:
public static void main(String[] args) {
if (!new File(IN_FILE).exists()) {
createFile();
}
charRead(true);
charRead(true);
charRead(false);
charRead(false);
blockRead(true);
blockRead(true);
blockRead(false);
blockRead(false);
}
给出这个结果(Intel Q9450,Windows 7 64bit,8GB RAM,在 7200rpm 1.5TB 驱动器上测试运行):
Created E:\temp\char.dat size 200,647,249 in 29.690 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 18.177 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.911 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 7.867 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.018 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.949 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 3.958 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.909 seconds. Hash: (not calculated)
结论: SHA1 哈希验证非常昂贵,这就是我运行有无版本的原因。基本上在热身后,“高效”版本的速度只有大约 2 倍。我猜这个时候文件实际上已经在内存中了。
如果我颠倒块和字符读取的顺序,结果是:
Created E:\temp\char.dat size 200,647,249 in 8.071 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 8.087 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 4.128 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.918 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 18.020 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 17.953 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.879 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.016 seconds. Hash: (not calculated)
有趣的是,逐字符版本在第一次读取文件时会受到更大的初始影响。
因此,像往常一样,这是效率和简单之间的选择。