如何在一系列文件中获得最大差异的两个文件答案

【问题标题】：How to get two files having max difference among a series of files如何在一系列文件中获得最大差异的两个文件
【发布时间】：2017-02-25 09:11:14
【问题描述】：

我有一系列 .csv 文件，其中包含由空格分隔的柱状（5 列）数据。文件名采用这种格式“yyyymmdd.csv”。文件格式例如如下：

20161201.csv的内容

key value more columns (this line (header) is absent)
123456 10000 some value
123457 20000 some value
123458 30000 some value

20161202.csv的内容

key value more columns (this line (header) is absent)
123456 10000 some value
123457 80000 some value
123458 30000 some value

20161203.csv的内容

key value more columns (this line (header) is absent)
123456 50000 some value
123457 70000 some value
123458 30000 some value

我想根据值列将日期为“D”的文件与日期为“D+1”的文件进行比较。然后我对最大行数不同的两个连续文件感兴趣。所以就像在这里，如果我将 20161201.csv 与 20161202.csv 进行比较，我只会得到第二行不匹配

(123457 20000 some value and 123457 80000 some value, mismatched because of 20000 != 80000)

如果我将 20161202.csv 与 20161203.csv 进行比较，我会得到 2 行不匹配（第一行和第二行）

因此，20161202.csv 和 20161203.csv 是我的目标文件。

我正在寻找可以执行相同操作的一系列 bash 命令。

PS：一个文件的行数很大（大约 3000 行），您可以假设所有文件都有相同的年份和月份（文件数

【问题讨论】：

不清楚请求的输出是什么。您需要每个比较对的输出，还是需要所有不匹配 = 唯一行的大输出，比如说 30 个文件（每天一个）？
另外，您查找的代码应该检查两个比较文件是否有 1 天的差异，或者这是默认授予的？
我想要一对对应于最大差异的。我的输出将是一对对应于最大差异。
好的。关于文件名？应该检查，或者我们可以假设我们有 30 个连续的文件，所以没有文件名检查？

标签： linux bash csv

【解决方案1】：

如果不检查文件名是否符合日期比较规则（数据文件与日期+1 文件），您可以执行以下操作：

while IFS= read -r -d '' fn;do files+=("$fn");done < <(find . -name '201612*.csv' -print0) 
#Load all filenames in an array. Using null separation we ensure that filenames will be  
#handled correctly no matter if they do contain spaces or other special chars.

max=0
for ((i=0;i<"${#files[@]}"-1;i++));do #iterate through the filenames array
  a="${files[i]}";b="${files[i+1]}" #compare file1 with file2, file2 with file3, etc - in series
  differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)
  echo "comparing $a vs $b - non matching lines=$differences" #Just for testing - can be removed .
  [[ "$max" -lt "$differences" ]] && max="$differences" && ahold="$a" && bhold="$b" #When we have the max differences we keep the names of the files
done

echo "max differences found=$max between $ahold and $bhold" #reporting max differences and in which files found

在两个文件之间获取不匹配行的核心是 grep。你可以手动试试grep看看结果是否正确：

grep -v -F -w -f <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2)

grep 选项：
-v ：返回不匹配的行（grep 的反向操作）
-F : 固定 -not regex - 匹配
-w : word match 避免 5000 与 50000 匹配
-f ：从文件加载模式，特别是从文件 1、字段 2。使用此模式，我们将 grep/搜索文件 2 的字段 2。
wc -l ：计算匹配=不匹配的行

使用 awk 的替代解决方案

您可以使用 awk 代替 grep ，如下所示：

awk 'NR==FNR{a[$2];next}!($2 in a)' file1 file2

这将打印与grep -v 相同的结果

file1/field2($2) 将被加载到数组a
将打印不在此数组中的 file2/field2 ($2) 行（不匹配的字段）。

也可以通过管道传送到|wc -l 以计算不匹配的行数，如在 grep 中。

所以如果你更喜欢使用 awk，这一行：

differences=$(grep -v -Fw -f <(cut -d' ' -f2 "$a") <(cut -d' ' -f2 "$b") |wc -l)

必须改为：

differences=$(awk 'NR==FNR{a[$2];next}!($2 in a)' $a $b |wc -l)

无论如何，您似乎需要一个数组来保存文件名，然后您需要一个循环来遍历文件并成对比较它们。

【讨论】：

@HrishikeshGoyal 答案解释，并更新了 awk 解决方案

【解决方案2】：

嗯，实施起来是一种挑战。

使用下面的代码，纯粹基于 awk（实际上是 gnu awk），我们所需要的只是一个起点/一个起始文件1。然后 awk 会自动获取下一个 file2（通过添加 1 天）并比较这两个文件的不同行。

如果链中缺少文件，则脚本重新调整files1和2的文件名，以遵守+1天的规则检查相邻文件中的不同行。

即使使用复制粘贴，您通常也应该能够运行脚本（即使包含 cmets 也可以在我的 bash 中运行），或者您可以将代码保存在一个单独的文件（即 test.awk）中，该文件将由 awk 加载-f 开关（awk -f test.awk）。

awk -v file1="20161201.csv" \
'function incfile(file,days)                                        #function receives two arguments: file and days
    {
    match(file,/(....)(..)(..)/,fn);                                #splits the string of file to format fn[1]=YYYY,fn[2]=MM and fn[3]=DD
    newfile=sprintf("%s%s%02d%s",fn[1],fn[2],fn[3]+days,".csv");    #this function increase the filename by days variable
    return (newfile)                                                #i.e file 20161201.csv returns 20161201+days
    };
BEGIN \
{
    chkdays=1; 
    while (chkdays<=15)
    {
        {
        file2=incfile(file1,1);                                     #Built filename of file2 by increasing file1 +1 day
        if (getline < file2 < 0)                                    #Check if file2 exists
            {
            print file1,"vs",file2,"skipped:",file2 "  not found";  #Print a help message - can be removed
            chkdays=chkdays+2;                                      #increase days counter for the while loop by 2
            file1=incfile(file1,2);                                 #Increase filename of file1 by 2 days (20161201 will be 20161203)
            file2=incfile(file2,2);                                 #The same for filename of file2 (20161202 will be 20161204)
            }
        else                                                        #if file2 exists
            {
            close(file2);                                           
            print "comparing",file1,"vs",file2; 
            while (getline var <file1)                              #read from file1 a line and assign it to var
                {split(var,ff1,OFS);a[ff1[2]]};                     #split line from file 2 (var) to fields, and keep the field2 in an array as index
            while (getline var2 <file2)
                {
                split(var2,ff2,OFS);                                #same for file2.split the line read (var2) 
                if (!(ff2[2] in a)) {print ">",var2;l=l+1};         #check if ff2[2] (file2-field2) is not found on the array created by file1-field2
                }
            if (l>maxd) {maxd=l;maxp=file1 " vs " file2};           #hold/save max different lines found and hold also the files that maxd was found
            file1=file2;                                            #Assign file2 to be file1 in order to repeat the loop
            chkdays=chkdays+1;                                      #Increase check days counter by 1
            delete a;l=0;close(file1);close(file2)                  #unset all necessary vars and close files
            }
        }
    };                                                              #End of BEGIN section
    print "max different lines=",maxd,"found at pair:",maxp         #Print the results
}'                                                                  #Finished

输出：

comparing 20161201.csv vs 20161202.csv
> 123457 80000 some value
comparing 20161202.csv vs 20161203.csv
> 123456 50000 some value
> 123457 70000 some value
20161203.csv vs 20161204.csv skipped: 20161204.csv  not found
20161205.csv vs 20161206.csv skipped: 20161206.csv  not found
20161207.csv vs 20161208.csv skipped: 20161208.csv  not found
20161209.csv vs 20161210.csv skipped: 20161210.csv  not found
comparing 20161211.csv vs 20161212.csv
> 123457 80000 some value
> 123458 15000 some value
> 123458 16000 some value
> 123458 17000 some value
comparing 20161212.csv vs 20161213.csv
> 123456 50000 some value
> 123457 70000 some value
> 123458 20000 some value
> 123458 25000 some value
> 123458 35000 some value
20161213.csv vs 20161214.csv skipped: 20161214.csv  not found
comparing 20161215.csv vs 20161216.csv
max different lines= 5 found at pair: 20161212.csv vs 20161213.csv

$ cat 20161212.csv
123456 10000 some value
123457 80000 some value
123458 30000 some value
123458 15000 some value
123458 16000 some value
123458 17000 some value

$ cat 20161213.csv
123456 50000 some value
123457 70000 some value
123458 20000 some value
123458 15000 some value
123458 25000 some value
123458 35000 some value

# csv files 01,02,03 are copy paste from your OP. file 11 is a copy of file 01.

PS：你可以去掉awk的所有打印部分，只保留最后一个summary命令。

希望此代码有用且运行良好。

【讨论】：