【问题标题】:Compare 4 files line by line to see if they match or don't match逐行比较 4 个文件,看它们是否匹配
【发布时间】:2019-11-04 16:23:07
【问题描述】:

我正在尝试比较 4 个文本文件的每行计数:

file1.txt:
32
44
75
22
88

file2.txt
32
44
75
22
88

file3.txt
11
44
75
22
77

file4.txt
    32
    44
    75
    22
    88

每一行代表一个标题

line1 = customerID count
line2 = employeeID count
line3 = active_users
line4 = inactive_users
line5 = deleted_users

我正在尝试将 file2.txtfile3.txtfile4.txtfile1.txt时间>; file1.txt 将始终具有正确的计数。

示例:由于 file2.txt 与上面示例中的 file1.txt 完全匹配,因此我正在尝试输出 "file2.txt很好”,但由于 file3.txt 第 1 行和第 5 行与 file1.txt 不匹配,我正在尝试输出 "customerID for file3。 txt 不匹配 21 条记录", (ie 32 - 11 = 21), 和 "file3.txt 中的已删除用户不匹配11 条记录”,(88 - 77 = 11)。

如果 shell 更简单,那也没关系。

【问题讨论】:

  • 你能提供你尝试过的代码示例吗?
  • file4.txt 中的前导空格是有意的还是相关的?

标签: bash shell file perl


【解决方案1】:

一种按行并行处理文件的方法

use warnings;
use strict;
use feature 'say';

my @files = @ARGV;
#my @files = map { $_ . '.txt' } qw(f1 f2 f3 f4);  # my test files' names

# Open all files, filehandles in @fhs
my @fhs = map { open my $fh, '<', $_  or die "Can't open $_: $!"; $fh } @files;

# For reporting, enumerate file names
my %files = map { $_ => $files[$_] } 0..$#files;

# Process (compare) the same line from all files       
my $line_cnt;
LINE: while ( my @line = map { my $line = <$_>; $line } @fhs )
{
    defined || last LINE for @line;
    ++$line_cnt;
    s/(?:^\s+|\s+$)//g for @line;
    for my $i (1..$#line) {
        if ($line[0] != $line[$i]) { 
            say "File $files[$i] differs at line $line_cnt"; 
        }
    }
}

这会通过== 比较整行(在去除前导和尾随空格之后),因为它是给定的,每行带有一个需要比较的单个数字。

它会打印,我的测试文件名为 f1.txtf2.txt、...

文件 f3.txt 在第 1 行有所不同 文件 f3.txt 在第 5 行有所不同

【讨论】:

    【解决方案2】:

    bash 的混搭,主要是标准实用程序的 GNU 版本,例如 diffsdiffsed,加上ifne util,甚至还有eval

    f=("" "customerID count" "employeeID count" \
       "active_users" "inactive_users" "deleted_users")
    for n in file{2..4}.txt ; do 
        diff -qws file1.txt $n || 
        $(sdiff file1 $n | ifne -n exit | nl | 
          sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' | 
          xargs printf 'eval echo "%s for '"$n"' does not match by %s records.";\n') ; 
    done
    

    输出:

    Files file1.txt and file2.txt are identical
    Files file1.txt and file3.txt differ
    customerID count for file3.txt does not match by 21 records.
    deleted_users for file3.txt does not match by 11 records.
    Files file1.txt and file4.txt are identical
    

    相同的代码,为更漂亮的输出进行了调整:

    f=("" "customerID count" "employeeID count" \
       "active_users" "inactive_users" "deleted_users")
    for n in file{2..4}.txt ; do 
        diff -qws file1.txt $n || 
        $(sdiff file1 $n | ifne -n exit | nl | 
          sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' | 
          xargs printf 'eval echo "%s does not match by %s records.";\n') ; 
    done  | 
    sed '/^Files/!s/^/\t/;/^Files/{s/.* and //;s/ are .*/ is good/;s/ differ$/:/}'
    

    输出:

    file2.txt is good
    file3.txt:
        customerID count does not match by 21 records.
        deleted_users does not match by 11 records.
    file4.txt is good
    

    【讨论】:

      【解决方案3】:

      将第一个文件读入数组,然后使用相同的函数遍历其他文件以读入数组。在这个循环中,考虑每一行,计算差异并打印带有来自@names 的文本的消息,如果差异不为零。

      #!/usr/bin/perl
      
      use strict;
      use warnings;
      
      my @names = qw(customerID_count employeeID_count active_users inactive_users deleted_users);
      my @files = qw(file1.txt file2.txt file3.txt file4.txt);
      
      my @first = readfile($files[0]);
      
      for (my $i = 1; $i <= $#files; $i++) {
          print "\n$files[0] <=> $files[$i]:\n";
          my @second = readfile($files[$i]);
          for (my $j = 0; $j <= $#names; $j++) {
              my $diff = $first[$j] - $second[$j];
              $diff = -$diff if $diff < 0;
              if ($diff > 0) {
                  print "$names[$j] does not match by $diff records\n";
              }
          }
      }
      
      sub readfile {
          my ($file) = @_;
          open my $handle, '<', $file;
          chomp(my @lines = <$handle>);
          close $handle;
          return grep(s/\s*//g, @lines);
      }
      

      输出是:

      file1.txt <=> file2.txt:
      
      file1.txt <=> file3.txt:
      customerID_count does not match by 21 records
      deleted_users does not match by 11 records
      
      file1.txt <=> file4.txt:
      

      【讨论】:

        【解决方案4】:

        将行名存储在一个数组中,将正确的值存储在另一个数组中。然后,遍历文件,并为每个文件读取它们的行并将它们与存储的正确值进行比较。您可以使用包含最后访问文件句柄的行号的特殊变量$. 作为数组的索引。行是从 1 开始的,数组是从 0 开始的,所以我们需要减 1 才能得到正确的索引。

        #!/usr/bin/perl
        use warnings;
        use strict;
        use feature qw{ say };
        
        my @line_names = ('customerID count',
                          'employeeID count',
                          'active_users',
                          'inactive_users',
                          'deleted_users');
        
        my @correct;
        open my $in, '<', shift or die $!;
        while (<$in>) {
            chomp;
            push @correct, $_;
        }
        
        while (my $file = shift) {
            open my $in, '<', $file or die $!;
            while (<$in>) {
                chomp;
                if ($_ != $correct[$. - 1]) {
                    say "$line_names[$. - 1] in $file does not match by ",
                        $correct[$. - 1] - $_, ' records';
                }
            }
        }
        

        【讨论】:

          【解决方案5】:

          这是一个 Perl 的例子:

          use feature qw(say);
          use strict;
          use warnings;
          
          {
              my $ref = read_file('file1.txt');
              my $N = 3;
              my @value_info;
              for my $i (1..$N) {
                  my $fn = 'file'.($i+1).'.txt';
                  my $values = read_file( $fn );
                  push @value_info, [ $fn, $values];
              }
              my @labels = qw(customerID employeeID active_users inactive_users deleted_users);
              for my $info (@value_info) {
                  my ( $fn, $values ) = @$info;
                  my $all_ok = 1;
                  my $j = 0;
                  for my $value (@$values) {
                      if ( $value != $ref->[$j] ) {
                          printf "%s: %s does not match by %d records\n",
                            $fn, $labels[$j], abs( $value - $ref->[$j] );
                          $all_ok = 0;
                      }
                      $j++;
                  }
                  say "$fn: is good" if $all_ok;
              }
          }
          
          sub read_file {
              my ( $fn ) = @_;
          
              my @values;
              open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
              while( my $line = <$fh>) {
                  if ( $line =~ /(\d+)/) {
                      push @values, $1;
                  }
              }
              close $fh;
              return \@values;
          }
          

          输出

          file2.txt: is good
          file3.txt: customerID does not match by 21 records
          file3.txt: deleted_users does not match by 11 records
          file4.txt: is good
          

          【讨论】:

            猜你喜欢
            • 2018-05-14
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2020-06-14
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多