我会根据每行的哈希码将文件拆分为多个文件。从 1x 50GB 文件制作 1000x 50MB 文件。然后分别处理每个文件,它会毫无问题地放入内存。
protected static string[] Partition(string inputFileName, string outPath, int partitions)
{
string[] fileNames = Enumerable.Range(0, partitions)
.Select(i => Path.Combine(outPath, "part" + i))
.ToArray();
StreamWriter[] writers = fileNames
.Select(fn => new StreamWriter(fn))
.ToArray();
StreamReader file = new StreamReader(inputFileName);
string line;
while ((line = file.ReadLine()) != null)
{
int partition = Math.Abs(line.GetHashCode() % partitions);
writers[partition].WriteLine(line);
}
file.Close();
writers.AsParallel().ForAll(c => c.Close());
return fileNames;
}
protected static void CountFile(string inputFileName, StreamWriter writer)
{
Dictionary<string, int> dict = new Dictionary<string, int>();
StreamReader file = new StreamReader(inputFileName);
string line;
while ((line = file.ReadLine()) != null)
{
int count;
if (dict.TryGetValue(line, out count))
{
dict[line] = count + 1;
}
else
{
dict.Add(line, 1);
}
}
file.Close();
foreach (var kv in dict)
{
writer.WriteLine(kv.Key + ": " + kv.Value);
}
}
protected static void CountFiles(string[] fileNames, string outFile)
{
StreamWriter writer = new StreamWriter(outFile);
foreach (var fileName in fileNames)
{
CountFile(fileName, writer);
}
writer.Close();
}
static void Main(string[] args)
{
var fileNames = Partition("./data/random2g.txt", "./data/out", 211);
CountFiles(fileNames, "./data/random2g.out");
}
基准测试
我决定尝试比较排序方法(Leon)和散列。如果你真的不需要它,排序是一项相当多的工作。我制作了包含 20 亿个数字的文件。分布(long)Math.Exp(rnd.NextDouble() * 30) 以相同的概率生成所有长度的数字(最多 14 个)。这种分布会产生许多独特的值,但同时也会产生重复多次的值。甚至字符的概率各不相同。这对人工数据来说还不错。
File size: 16,8GiB
Number of lines: 2G (=2000000000)
Number of distinct lines: 576M
Line occurences: 1..46M, average: 3,5
Line length: 1..14, average: 7
Used characters: '0', '1',...,'9'
Character frequency: 8,8%..13%, average: 10%
Disc: SSD
排序结果
10M lines in partition
10M distinct lines in partition
114 partitions
Partition size: 131MiB
Sum of partitions size: 14,6GiB
Partitioning time: 105min
Merging time: 180min
Total time: 285min (=4hod 45min)
这种方法可以节省空间,因为分区包含部分合并的数据。
散列结果
7M..54M lines in partition, average: 9,5M
2723766..2732318 distinct lines in partition, average: 2,73M
211 partitions
Partition size 73MiB..207MiB, average: 81MiB
Sum of partitions size: 16,8GiB
Partitioning time: 6min
Merging time: 15min
Total time: 21min
虽然每个分区的大小不同,但所有分区中不同行的数量几乎相同。这意味着哈希函数按预期工作。并且处理每个分区所需的内存是相同的。但确实不能保证,因此如果需要高可靠性,则必须为这些情况添加一些后备策略(将文件重新散列到更小,切换到对该文件进行排序等)。很有可能,它永远不会被真正使用,所以从性能的角度来看,这不是问题。
散列比按因子排序超过 10,另一方面,其中一些可能源于 python 本身的低效。