如何计算linux上两个文件之间的差异？答案

【问题标题】：How to count differences between two files on linux?如何计算linux上两个文件之间的差异？
【发布时间】：2010-12-06 16:35:59
【问题描述】：

我需要处理大文件，并且必须找出两者之间的差异。而且我不需要不同的位，而是差异的数量。

要找到我想出的不同行数

diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l

它有效，但有更好的方法吗？

以及如何计算差异的确切数量（使用标准工具，如 bash、diff、awk、sed 一些旧版本的 perl）？

【问题讨论】：

问题中哪里说他想计算行差异，而不是 character 差异？我看到“位”和“确切数量的差异”，但“行”只是他尝试这样做..

标签： shell count diff

【解决方案1】：

这是一种计算两个文件之间任何类型差异的方法，并为这些差异指定了正则表达式 - 这里 . 用于除换行符以外的任何字符：

git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l

摘自man git-diff：

--patience
           Generate a diff using the "patience diff" algorithm.
--word-diff[=<mode>]
           Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below.
           porcelain
               Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff
               format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input
               are represented by a tilde ~ on a line of its own.
--word-diff-regex=<regex>
           Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it
           was already enabled.
           Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!)
           for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches
           all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline.
           For example, --word-diff-regex=.  will treat each character as a word and, correspondingly, show differences character by character.

pcre2grep 是 Ubuntu 20.04 上 pcre2-utils 软件包的一部分。

【讨论】：

【解决方案2】：

我相信正确的解决方案是在这个answer，即：

$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1

【讨论】：

【解决方案3】：

如果您要处理的文件具有相似的内容，这些文件应该逐行排序（例如描述类似内容的 CSV 文件），并且您会使用例如想在以下文件中找到 2 个不同之处：

File a:    File b:
min,max    min,max
1,5        2,5
3,4        3,4
-2,10      -1,1

你可以像这样在 Python 中实现它：

different_lines = 0
with open(file1) as a, open(file2) as b:
    for line in a:
        other_line = b.readline()
        if line != other_line:
            different_lines += 1

【讨论】：

【解决方案4】：

由于每个不同的输出行都以< 或> 字符开头，我建议这样做：

diff file1 file2 | grep ^[\>\<] | wc -l

通过在脚本行中仅使用\< 或\>，您可以仅计算其中一个文件的差异。

【讨论】：

这会重复计算行数，因为“”可以打印在同一行。

【解决方案5】：

如果您想计算不同的行数，请使用：

diff -U 0 file1 file2 | grep ^@ | wc -l

约翰的回答不是重复计算不同的行吗？

【讨论】：

是的，它是双重计算。请参阅我对已接受答案的评论。此答案中的命令是正确的。
这对我来说似乎也可能重复计算行数，无论是在 MacOSX 还是 Ubuntu 上。连续的批次可以组合在一个块中，这取决于您的任务是应该是一个差异还是几个差异。
正如@khedron 指出的那样，可以将成批的连续行组合在一个块中。据我估计，这意味着这种方法容易被低估。
你可以写grep -c ^@而不是grep ^@ | wc -l
“容易被低估”说得客气了 - 在两个完全不同的文件上运行这个命令，它会给你一个结果 1。

【解决方案6】：

如果使用 Linux/Unix，comm -1 file1 file2 打印 file1 中不在 file2 中的行，comm -1 file1 file2 | wc -l 计算它们，comm -2 ... 类似？

【讨论】：

sureshw 在另一个答案中指出，comm 期望它的参数是 sorted 文件。所以这个建议只能在特殊情况下依赖。（我认为使用 awk 编写自己的 comm 版本也很容易，它也适用于未排序的输入，但怀疑这是否符合原始问题的精神。）

【解决方案7】：

diff -U 0 file1 file2 | grep -v ^@ | wc -l

diff 列表顶部的两个文件名减去 2。统一格式可能比并排格式快一点。

【讨论】：

这不行，因为我定义“工作”pastie.org/pastes/3179433/text每个文件只有一个字符，数字“4”与什么有关？
这行得通。对于您的示例，您有四行：前两行是每个文件的名称（如答案中所述），另外两行是两个差异，删除了 'a' 的 1 行和添加了 'b' 的 1 行。
这取决于你如何计算差异。在此示例pastie.org/5553254 中，我认为有 2 行不同，即我同意红杉 mcdowell。必须从结果中减去 2 也很不方便（由于打印了 2 个 diff:ed 文件）。因此，我认为乔希的答案是正确的。它可以通过在 grep 上使用 –c (count) 选项来稍微缩短，而不是通过管道连接到 wc –l，如下所示：diff -U 0 file1 file2 | grep -c ^@
diff -U 0 file1 file2 | grep -v ^@ | tail -n +3 | wc -l 应该给出正确的计数。它不包括 diff 输出顶部的文件名。
正确的解决方案在这里 unix.stackexchange.com/questions/53719/… 作为接受的答案