嗯,实施起来是一种挑战。
使用下面的代码,纯粹基于 awk(实际上是 gnu awk),我们所需要的只是一个起点/一个起始文件1。然后 awk 会自动获取下一个 file2(通过添加 1 天)并比较这两个文件的不同行。
如果链中缺少文件,则脚本重新调整files1和2的文件名,以遵守+1天的规则检查相邻文件中的不同行。
即使使用复制粘贴,您通常也应该能够运行脚本(即使包含 cmets 也可以在我的 bash 中运行),或者您可以将代码保存在一个单独的文件(即 test.awk)中,该文件将由 awk 加载-f 开关(awk -f test.awk)。
awk -v file1="20161201.csv" \
'function incfile(file,days) #function receives two arguments: file and days
{
match(file,/(....)(..)(..)/,fn); #splits the string of file to format fn[1]=YYYY,fn[2]=MM and fn[3]=DD
newfile=sprintf("%s%s%02d%s",fn[1],fn[2],fn[3]+days,".csv"); #this function increase the filename by days variable
return (newfile) #i.e file 20161201.csv returns 20161201+days
};
BEGIN \
{
chkdays=1;
while (chkdays<=15)
{
{
file2=incfile(file1,1); #Built filename of file2 by increasing file1 +1 day
if (getline < file2 < 0) #Check if file2 exists
{
print file1,"vs",file2,"skipped:",file2 " not found"; #Print a help message - can be removed
chkdays=chkdays+2; #increase days counter for the while loop by 2
file1=incfile(file1,2); #Increase filename of file1 by 2 days (20161201 will be 20161203)
file2=incfile(file2,2); #The same for filename of file2 (20161202 will be 20161204)
}
else #if file2 exists
{
close(file2);
print "comparing",file1,"vs",file2;
while (getline var <file1) #read from file1 a line and assign it to var
{split(var,ff1,OFS);a[ff1[2]]}; #split line from file 2 (var) to fields, and keep the field2 in an array as index
while (getline var2 <file2)
{
split(var2,ff2,OFS); #same for file2.split the line read (var2)
if (!(ff2[2] in a)) {print ">",var2;l=l+1}; #check if ff2[2] (file2-field2) is not found on the array created by file1-field2
}
if (l>maxd) {maxd=l;maxp=file1 " vs " file2}; #hold/save max different lines found and hold also the files that maxd was found
file1=file2; #Assign file2 to be file1 in order to repeat the loop
chkdays=chkdays+1; #Increase check days counter by 1
delete a;l=0;close(file1);close(file2) #unset all necessary vars and close files
}
}
}; #End of BEGIN section
print "max different lines=",maxd,"found at pair:",maxp #Print the results
}' #Finished
输出:
comparing 20161201.csv vs 20161202.csv
> 123457 80000 some value
comparing 20161202.csv vs 20161203.csv
> 123456 50000 some value
> 123457 70000 some value
20161203.csv vs 20161204.csv skipped: 20161204.csv not found
20161205.csv vs 20161206.csv skipped: 20161206.csv not found
20161207.csv vs 20161208.csv skipped: 20161208.csv not found
20161209.csv vs 20161210.csv skipped: 20161210.csv not found
comparing 20161211.csv vs 20161212.csv
> 123457 80000 some value
> 123458 15000 some value
> 123458 16000 some value
> 123458 17000 some value
comparing 20161212.csv vs 20161213.csv
> 123456 50000 some value
> 123457 70000 some value
> 123458 20000 some value
> 123458 25000 some value
> 123458 35000 some value
20161213.csv vs 20161214.csv skipped: 20161214.csv not found
comparing 20161215.csv vs 20161216.csv
max different lines= 5 found at pair: 20161212.csv vs 20161213.csv
$ cat 20161212.csv
123456 10000 some value
123457 80000 some value
123458 30000 some value
123458 15000 some value
123458 16000 some value
123458 17000 some value
$ cat 20161213.csv
123456 50000 some value
123457 70000 some value
123458 20000 some value
123458 15000 some value
123458 25000 some value
123458 35000 some value
# csv files 01,02,03 are copy paste from your OP. file 11 is a copy of file 01.
PS:你可以去掉awk的所有打印部分,只保留最后一个summary命令。
希望此代码有用且运行良好。