比较两个文件和两个文件的输出差异（包括行号和内容）答案

【问题标题】：Compare two files & output differences (including Line Number and content) from both files比较两个文件和两个文件的输出差异（包括行号和内容）
【发布时间】：2021-09-28 17:35:50
【问题描述】：

我试图在另一个文件或标准输出中获取两个文件、行号和内容的差异。我尝试了以下操作，但无法获得所需的确切输出。请看下文。

文件内容：

文件1：

Col1,Col2,Col3
Text1,text1,text1
Text2,text2,Rubbish

文件2：

Col1,Col2,Col3
Text1,text1,text1
Text2,text2,text2
Text3,text3,text3

我已经尝试了以下代码，它没有提供确切的所需输出，因为它只显示了第一个文件中的差异，而不是 file2 中的额外行。

sort file1 file2 | uniq | awk 'FNR==NR{ a[$1]; next } !($1 in a) {print FNR": "$0}' file2 file1

输出

3: Text2,text2,Rubbish

期望的输出

3: Text2,text2,Rubbish (File1)
3: Text2,text2,text2 (File2)
4: Text3,text3,text3 (File2)

由于输出原因，我不希望使用 diff/sdiff/comm，因为我无法添加行号并并排组织数据以便于阅读。普通文件会超过 1000 行，因此 diff/sdiff 实用程序变得更难阅读。

【问题讨论】：

标签： file awk compare

【解决方案1】：

使用您显示的示例，请尝试遵循awk 代码。用 GNU awk 编写和测试。

awk '
BEGIN { OFS=": " }
FNR==1{ next     }
FNR==NR{
  arr[$0]=FNR
  next
}
!($0 in arr){
  print FNR,$0" ("FILENAME")"
  next
}
{
  arr1[$0]
}
END{
  for(key in arr){
    if(!(key in arr1)){
      print arr[key],key" ("ARGV[1]")"
    }
  }
}
' file1 file2

说明：为上述添加详细说明。

awk '                                   ##Starting awk program from here.
BEGIN { OFS=": " }                      ##Setting OFS to colon space in BEGIN section of this program.
FNR==1{ next     }                      ##Skipping if there is FNR==1 for both the files.
FNR==NR{                                ##Checking condition if FNR==NR then do following.
  arr[$0]=FNR                           ##Creating arr with index of current line has FNR as value.
  next                                  ##Will skip all further statements from here.
}
!($0 in arr){                           ##If current line is NOT in arr(to get lines which are in file2 but not in file1)
  print FNR,$0" ("FILENAME")"           ##Printing as per OP request number with file name, line.
  next                                  ##Will skip all further statements from here.
}
{
  arr1[$0]                              ##Creating arr1 which has index as current line in it.
}
END{                                    ##Starting END section of this program from here.
  for(key in arr){                      ##Traversing through arr here.
    if(!(key in arr1)){                 ##If key is NOT present in arr1.
      print arr[key],key" ("ARGV[1]")"   ##Printing values of arr and first file name, basically getting lines which are present in file1 and NOT in file2.
    }
  }
}
' file1 file2                           ##Mentioning Input_file names here.

【讨论】：

非常感谢您的帮助和详细解释。我喜欢这两种解决方案，因此很难选择 - 但由于您的解决方案非常详细、清晰且仅使用 awk - 我接受了您的回答。

【解决方案2】：

您可以使用 GNU diff + awk 获得所需的输出：

funkydiff(){
   # configure diff to output:
   #    - 1 or 2 as flag character to identify file
   #    - space
   #    - line number formatted as integer
   #    - colon, space
   #    - the line content including terminating newline
   diff -d \
       --old-line-format='1 %dn: %L' \
       --new-line-format='2 %dn: %L' \
       --unchanged-line-format='' \
       -- "$1" "$2" \
   | awk '
      # p starts off null
      # initialise filename lookup table
      !p { f[1]=a; f[2]=b }

      {
         # discard first 2 characters of input line
         # append bracketed filename if flag character changed
         # print the result
         print substr($0,3) ($1!=p?" ("f[$1]")":"")

         # update p ready for next line
         p=$1
      }
   ' a="$1" b="$2"
}

funkydiff File1 File2

GNU diff 完成了大部分艰苦的工作。

awk 检查（并删除）diff 在每行开头添加的文件名标识符前缀，并在必要时在括号中附加适当的文件名。

根据您修改后（更简单）的要求，即在每一行打印文件名，而不仅仅是每组的第一行，awk 可以简化：

funkydiff2(){
   diff -d \
      --old-line-format='1 %dn: %L' \
      --new-line-format='2 %dn: %L' \
      --unchanged-line-format='' \
      -- "$1" "$2" \
   | awk '
      !p { f[1]=a; f[2]=b }

      # always append bracketed filename
      # update p as a side-effect (just for brevity)
      { print substr($0,3) " (" f[p=$1] ")" }
   ' a="$1" b="$2"
}

事实上，如果文件名不包含特殊的% 字符，则根本不需要 awk：

funkydiff3(){
    case "$1$2" in
        *%*)
            echo 1>&2 "ERROR: funky filename. Aborting"
            return 1
            ;;
    esac

    # now $1 and $2 cannot contain the % metacharacter
    # %c'\012' produces newline
    # alternatively you could embed literal newlines
    diff -d \
        --old-line-format="%dn: %l ($1)%c'\012'" \
        --new-line-format="%dn: %l ($2)%c'\012'" \
        --unchanged-line-format="" \
        -- "$1" "$2"
}

因为文件名直接嵌入到传递给diff 的格式字符串中，如果它们包含任何%，格式字符串要么会更改为意外的内容、格式错误，要么会导致文件名在输出时被破坏。

GNU diff 的手册页包含它允许在格式字符串中使用的 % 序列的详细信息。

【讨论】：

谢谢 - 我修改了所需的输出，因为我错过了为 file2 的 line3 添加文件名。请问您能否进一步分解您的 awk 代码？我删除了 awk 部分并得到了类似的输出。 awk 的新手，但试图更多地使用它。 1 3: Text2,text2,Rubbish 2 3: Text2,text2,text2 2 4: Text3,text3,text3
非常感谢您提供的帮助。我不知道 diff 有那么强大，并且可以在没有 awk 的情况下做我需要的事情。 gnu.org/software/diffutils/manual/html_node/Line-Formats.html 使用此站点了解有关行格式的更多信息。再次感谢您的支持。我可以确认您的解决方案也有效。