根据列值基数拆分大型 CSV 文件答案

【问题标题】：Split large CSV file based on column value cardinality根据列值基数拆分大型 CSV 文件
【发布时间】：2015-09-03 16:52:20
【问题描述】：

我有一个大型 CSV 文件，其行格式如下：

c1,c2

我想把原文件分成两个文件，如下：

一个文件将包含 c1 的值在文件中恰好出现一次的行。
另一个文件将包含 c1 的值在文件中出现两次或更多次的行。

知道怎么做吗？

例如，如果原始文件是：

1,foo
2,bar
3,foo
4,bar
2,foo
1,bar

我想生成以下文件：

3,foo
4,bar

和

1,foo
2,bar
2,foo
1,bar

【问题讨论】：

“大”有多大——因为你真正能做到这一点的唯一方法是检查你的文件两次——在你完成整个过程之前，你不会知道特定值的计数.因此，您要么需要阅读两次，要么将全部内容保存在内存中。
另外：保留订单有多重要？

标签： awk split large-files

【解决方案1】：

这一行生成两个文件o1.csv and o2.csv

awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' file file

测试：

kent$  cat f
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar

kent$  awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' f f

kent$  head o*
==> o1.csv <==
3,foo
4,bar

==> o2.csv <==
1,foo
2,bar
2,foo
1,bar

注意

awk 读取您的文件两次，而不是将整个文件保存在内存中
文件的顺序被保留

【讨论】：

【解决方案2】：

取决于你所说的大，这可能对你有用。它必须保持行，在关联数组中完成，直到它看到第二次使用，或直到文件结束。当看到第二次使用时，记住数据更改为“！”以避免在第 3 次及以后的比赛中再次打印。

>file2
awk -F, '
{ if(done[$1]!=""){
    if(done[$1]!="!"){
     print done[$1]
     done[$1] = "!"
    }
    print
  }else{ 
   done[$1] = $0
   order[++n] = $1
  }
}
END{
  for(i=1;i<=n;i++){
   out = done[order[i]]
   if(out!="!")print out >>"file2"
  }
}
' <csvfile >file1

【讨论】：

【解决方案3】：

我会为这份工作打破 Perl

#!/usr/bin/env perl

use strict; 
use warnings;

my %count_of;
my @lines; 

open ( my $input, '<', 'your_file.csv' ) or die $!; 

#read the whole file
while ( <$input> ) {
   my ( $c1, $c2 ) = split /,/;
   $count_of{$c1}++; 
   push ( @lines, [ $c1 , $c2 ] ); 
}
close ( $input ); 

print "File 1:\n";
#filter any single elements
foreach my $pair ( grep { $count_of{$_ -> [0]} < 2 } @lines ) {
    print join (",", @$pair );
}

print "File 2:\n"; 
#filter any repeats. 
foreach my $pair ( grep { $count_of{$_ -> [0]} > 1 } @lines ) {
    print join (",", @$pair );
}

这会将整个文件保存在内存中，但考虑到您的数据 - 通过双重处理和维护计数不会节省太多空间。

你可以做什么：

#!/usr/bin/env perl

use strict;
use warnings;

my %count_of;

open( my $input, '<', 'your_file.csv' ) or die $!;

#read the whole file counting "c1"
while (<$input>) {
    my ( $c1, $c2 ) = split /,/;
    $count_of{$c1}++;
}

open( my $output_single, '>', "output_uniques.csv" ) or die $!;
open( my $output_dupe,   '>', "output_dupes.csv" )   or die $!;

seek( $input, 0, 0 );
while ( my $line = <$input> ) {
    my ($c1) = split( ",", $line );
    if ( $count_of{$c1} > 1 ) {
        print {$output_dupe} $line;
    }
    else {
        print {$output_single} $line;
    }
}

close($input);
close($output_single);
close($output_dupe);

这将通过仅保留计数来最小化内存占用 - 它首先读取文件以计算 c1 值，然后再次处理它并将行打印到不同的输出。

【讨论】：