逐行比较 4 个文件，看它们是否匹配答案

【问题标题】：Compare 4 files line by line to see if they match or don't match逐行比较 4 个文件，看它们是否匹配
【发布时间】：2019-11-04 16:23:07
【问题描述】：

我正在尝试比较 4 个文本文件的每行计数：

file1.txt:
32
44
75
22
88

file2.txt
32
44
75
22
88

file3.txt
11
44
75
22
77

file4.txt
    32
    44
    75
    22
    88

每一行代表一个标题

line1 = customerID count
line2 = employeeID count
line3 = active_users
line4 = inactive_users
line5 = deleted_users

我正在尝试将 file2.txt、file3.txt 和 file4.txt 与 file1.txt时间>; file1.txt 将始终具有正确的计数。

示例：由于 file2.txt 与上面示例中的 file1.txt 完全匹配，因此我正在尝试输出 "file2.txt很好”，但由于 file3.txt 第 1 行和第 5 行与 file1.txt 不匹配，我正在尝试输出 "customerID for file3。 txt 不匹配 21 条记录", (ie 32 - 11 = 21), 和 "file3.txt 中的已删除用户不匹配11 条记录”，(88 - 77 = 11)。

如果 shell 更简单，那也没关系。

【问题讨论】：

你能提供你尝试过的代码示例吗？
file4.txt 中的前导空格是有意的还是相关的？

标签： bash shell file perl

【解决方案1】：

一种按行并行处理文件的方法

use warnings;
use strict;
use feature 'say';

my @files = @ARGV;
#my @files = map { $_ . '.txt' } qw(f1 f2 f3 f4);  # my test files' names

# Open all files, filehandles in @fhs
my @fhs = map { open my $fh, '<', $_  or die "Can't open $_: $!"; $fh } @files;

# For reporting, enumerate file names
my %files = map { $_ => $files[$_] } 0..$#files;

# Process (compare) the same line from all files       
my $line_cnt;
LINE: while ( my @line = map { my $line = <$_>; $line } @fhs )
{
    defined || last LINE for @line;
    ++$line_cnt;
    s/(?:^\s+|\s+$)//g for @line;
    for my $i (1..$#line) {
        if ($line[0] != $line[$i]) { 
            say "File $files[$i] differs at line $line_cnt"; 
        }
    }
}

这会通过== 比较整行（在去除前导和尾随空格之后），因为它是给定的，每行带有一个需要比较的单个数字。

它会打印，我的测试文件名为 f1.txt、f2.txt、...

文件 f3.txt 在第 1 行有所不同文件 f3.txt 在第 5 行有所不同

【讨论】：

【解决方案2】：

bash 的混搭，主要是标准实用程序的 GNU 版本，例如 diff、sdiff、sed、等，加上ifne util，甚至还有eval：

f=("" "customerID count" "employeeID count" \
   "active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do 
    diff -qws file1.txt $n || 
    $(sdiff file1 $n | ifne -n exit | nl | 
      sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' | 
      xargs printf 'eval echo "%s for '"$n"' does not match by %s records.";\n') ; 
done

输出：

Files file1.txt and file2.txt are identical
Files file1.txt and file3.txt differ
customerID count for file3.txt does not match by 21 records.
deleted_users for file3.txt does not match by 11 records.
Files file1.txt and file4.txt are identical

相同的代码，为更漂亮的输出进行了调整：

f=("" "customerID count" "employeeID count" \
   "active_users" "inactive_users" "deleted_users")
for n in file{2..4}.txt ; do 
    diff -qws file1.txt $n || 
    $(sdiff file1 $n | ifne -n exit | nl | 
      sed -n '/|/{s/[1-5]/${f[&]}/;s/\s*|\s*/-/;s/\([0-9-]*\)$/$((&))/;p}' | 
      xargs printf 'eval echo "%s does not match by %s records.";\n') ; 
done  | 
sed '/^Files/!s/^/\t/;/^Files/{s/.* and //;s/ are .*/ is good/;s/ differ$/:/}'

输出：

file2.txt is good
file3.txt:
    customerID count does not match by 21 records.
    deleted_users does not match by 11 records.
file4.txt is good

【讨论】：

【解决方案3】：

将第一个文件读入数组，然后使用相同的函数遍历其他文件以读入数组。在这个循环中，考虑每一行，计算差异并打印带有来自@names 的文本的消息，如果差异不为零。

#!/usr/bin/perl

use strict;
use warnings;

my @names = qw(customerID_count employeeID_count active_users inactive_users deleted_users);
my @files = qw(file1.txt file2.txt file3.txt file4.txt);

my @first = readfile($files[0]);

for (my $i = 1; $i <= $#files; $i++) {
    print "\n$files[0] <=> $files[$i]:\n";
    my @second = readfile($files[$i]);
    for (my $j = 0; $j <= $#names; $j++) {
        my $diff = $first[$j] - $second[$j];
        $diff = -$diff if $diff < 0;
        if ($diff > 0) {
            print "$names[$j] does not match by $diff records\n";
        }
    }
}

sub readfile {
    my ($file) = @_;
    open my $handle, '<', $file;
    chomp(my @lines = <$handle>);
    close $handle;
    return grep(s/\s*//g, @lines);
}

输出是：

file1.txt <=> file2.txt:

file1.txt <=> file3.txt:
customerID_count does not match by 21 records
deleted_users does not match by 11 records

file1.txt <=> file4.txt:

【讨论】：

【解决方案4】：

将行名存储在一个数组中，将正确的值存储在另一个数组中。然后，遍历文件，并为每个文件读取它们的行并将它们与存储的正确值进行比较。您可以使用包含最后访问文件句柄的行号的特殊变量$. 作为数组的索引。行是从 1 开始的，数组是从 0 开始的，所以我们需要减 1 才能得到正确的索引。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my @line_names = ('customerID count',
                  'employeeID count',
                  'active_users',
                  'inactive_users',
                  'deleted_users');

my @correct;
open my $in, '<', shift or die $!;
while (<$in>) {
    chomp;
    push @correct, $_;
}

while (my $file = shift) {
    open my $in, '<', $file or die $!;
    while (<$in>) {
        chomp;
        if ($_ != $correct[$. - 1]) {
            say "$line_names[$. - 1] in $file does not match by ",
                $correct[$. - 1] - $_, ' records';
        }
    }
}

【讨论】：

【解决方案5】：

这是一个 Perl 的例子：

use feature qw(say);
use strict;
use warnings;

{
    my $ref = read_file('file1.txt');
    my $N = 3;
    my @value_info;
    for my $i (1..$N) {
        my $fn = 'file'.($i+1).'.txt';
        my $values = read_file( $fn );
        push @value_info, [ $fn, $values];
    }
    my @labels = qw(customerID employeeID active_users inactive_users deleted_users);
    for my $info (@value_info) {
        my ( $fn, $values ) = @$info;
        my $all_ok = 1;
        my $j = 0;
        for my $value (@$values) {
            if ( $value != $ref->[$j] ) {
                printf "%s: %s does not match by %d records\n",
                  $fn, $labels[$j], abs( $value - $ref->[$j] );
                $all_ok = 0;
            }
            $j++;
        }
        say "$fn: is good" if $all_ok;
    }
}

sub read_file {
    my ( $fn ) = @_;

    my @values;
    open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
    while( my $line = <$fh>) {
        if ( $line =~ /(\d+)/) {
            push @values, $1;
        }
    }
    close $fh;
    return \@values;
}

输出：

file2.txt: is good
file3.txt: customerID does not match by 21 records
file3.txt: deleted_users does not match by 11 records
file4.txt: is good

【讨论】：