我会为这份工作打破 Perl
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
my @lines;
open ( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file
while ( <$input> ) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
push ( @lines, [ $c1 , $c2 ] );
}
close ( $input );
print "File 1:\n";
#filter any single elements
foreach my $pair ( grep { $count_of{$_ -> [0]} < 2 } @lines ) {
print join (",", @$pair );
}
print "File 2:\n";
#filter any repeats.
foreach my $pair ( grep { $count_of{$_ -> [0]} > 1 } @lines ) {
print join (",", @$pair );
}
这会将整个文件保存在内存中,但考虑到您的数据 - 通过双重处理和维护计数不会节省太多空间。
你可以做什么:
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
open( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file counting "c1"
while (<$input>) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
}
open( my $output_single, '>', "output_uniques.csv" ) or die $!;
open( my $output_dupe, '>', "output_dupes.csv" ) or die $!;
seek( $input, 0, 0 );
while ( my $line = <$input> ) {
my ($c1) = split( ",", $line );
if ( $count_of{$c1} > 1 ) {
print {$output_dupe} $line;
}
else {
print {$output_single} $line;
}
}
close($input);
close($output_single);
close($output_dupe);
这将通过仅保留计数来最小化内存占用 - 它首先读取文件以计算 c1 值,然后再次处理它并将行打印到不同的输出。