【问题标题】:UNIX - Compare 2 files based on primary columnUNIX - 根据主列比较 2 个文件
【发布时间】:2019-08-09 17:20:57
【问题描述】:

我需要根据主列逐列比较 2 个文件(它可以是 1 列或多列作为灵长类键)。它应该生成 3 个 csv 文件作为输出 - 差异,file1 中的额外记录,file2 中的额外记录

注意:尝试使用sdiff,但它没有提供所需的输出

例子:

这里第一列是主键

file1 :
abc 234 123
bcd 567 890
cde 678 789

file2 :
abc 234 012
bcd 532 890
cdf 678 789

Output files

differences file :
abc,234,123::012
bcd,567::532,890

extra records in file1 :
cde,678,789

extra records in file2
cdf,678,789   

【问题讨论】:

  • 检查comm二进制文件,一个标准的unix命令行工具comm --help
  • comm 将给出不匹配的记录,但不会突出显示差异在哪一列
  • 您可以先将数据转换为“长格式”,详情查看R-tidyverse文档。

标签: perl unix awk nawk


【解决方案1】:

如果文件可以轻松地放入内存中,那么在 Perl 中使用散列是很容易的。例如:

#!/bin/bash

# create test data files
>cmp.d1 cat <<'EOD'
abc 234 123
bcd 567 890
cde 678 789
EOD
>cmp.d2 cat <<'EOD'
abc 234 012
bcd 532 890
cdf 678 789
EOD

# create script
>dif.pl cat <<'EOD'
#!/usr/bin/perl -w

if ( $#ARGV!=0 or ! -f "$ARGV[0]" ) {
    die "Usage: <file2 filter file1\n";
}

@KEYS = ( 0 ); # list of columns to use for primary key

# read file1 from filename given on commandline
while (<<>>) {
    chomp;
    @a1 = (split); # split line into individual fields
    $k = join "\0", @a1[ @KEYS ];

    # if $k is not unique, only final line is kept
    warn "duplicate key: $k\n" if exists $h1{$k};

    # store line in %h1 for later use
    $h1{$k} = [ @a1 ];
}

# now read file2 from stdin
# process each line as we read it
while (<<>>) {
    chomp;
    @a2 = (split); # split line into individual fields
    $k = join "\0", @a2[ @KEYS ];

    if ( exists $h1{$k} ) {
        # record exists in both files
        # calculate differences 

        @a1 = @{ $h1{$k} }; # retrieve file1 version

        # overwrite any difference fields in @a2
        map {
            $a1 = shift @a1;
            $_ = "${a1}::$_" if $a1 ne $_;
        } @a2;

        # save difference records in %hd
        $hd{$k} = [ @a2 ];

        # this will not be an extra file1 record
        delete $h1{$k};
    }
    else {
        # this record only exists in file2
        $h2{$k} = [ @a2 ];
    }
}

# format record as csv line
sub print_csv {
    print join(",", @{ $_ }), "\n";
}

print "differences file :\n";
print_csv for values %hd;
print "\n";

print "extra records in file1 :\n";
print_csv for values %h1;
print "\n";

print "extra records in file2\n";
print_csv for values %h2;

EOD

# try it out
perl dif.pl cmp.d1 <cmp.d2

输出:

differences file :
bcd,567::532,890
abc,234,123::012

extra records in file1 :
cde,678,789

extra records in file2
cdf,678,789

注意: csv 输出通常不需要排序,因此此代码不会进行任何排序。

【讨论】:

    【解决方案2】:

    试试这个命令行 Perl

    perl -lane ' @t=@{$kv1{$F[0]}}; push(@t,$_); $kv1{$F[0]} = [@t];
    if( defined($kv2{$F[0]}) ) {  $kv2{$F[0]} = "Both" } else { $kv2{$F[0]} =$ARGV; $kv3{$F[0]}=$_; }
    END { 
    
     for my $c (keys %kv2) 
     { 
       if($kv2{$c} eq "Both") { $d1++ or print "differences file :";  
    
       @t=@{$kv1{$c}}; @s1=split(" ",$t[0]); @s2=split(" ",$t[1]);
       $a2= $s1[1] eq $s2[1] ? $s1[1] : $s1[1]. "::". $s2[1];
       $a3= $s1[2] eq $s2[2] ? $s1[2] : $s1[2]. "::". $s2[2];
       print $s1[0],",",$a2,",",$a3;
    
       }
     }
    
     for my $c (keys %kv2) 
     { 
       if($kv2{$c} eq "file1") { $d2++ or print "\nextra records in file1 :";  print $kv3{$c} }
     }
    
     for my $c (keys %kv2) 
     { 
       if($kv2{$c} eq "file2") { $d3++ or print "\nextra records in file2 :";  print $kv3{$c} }
     }
    
    }
    ' file1 file2
    

    结果:

    differences file :
    bcd,567::532,890
    abc,234,123::012
    
    extra records in file1 :
    cde 678 789
    
    extra records in file2 :
    cdf 678 789
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-04-09
      • 1970-01-01
      • 2019-09-08
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多