检查文件中的多行内容答案

【问题标题】：Check for multi-line content in a file检查文件中的多行内容
【发布时间】：2019-11-22 19:32:49
【问题描述】：

我正在尝试使用常用 bash 命令（grep、awk、...）检查文件中是否存在多行字符串。

我想要一个包含几行（纯行而不是模式）的文件，它应该存在于另一个文件中，并创建一个命令（序列）来检查它是否存在。如果grep 可以接受任意多行模式，我会使用类似于

grep "`cat contentfile`" targetfile

与 grep 一样，我希望能够检查命令的退出代码。我对输出并不感兴趣。实际上，从那时起，我就不必通过管道传输到 /dev/null 了。

我已经搜索过提示，但找不到能提供任何好的搜索结果的搜索。有How can I search for a multiline pattern in a file?，但那是关于模式匹配的。

我找到了pcre2grep，但需要使用“标准”*nix 工具。

例子：

内容文件：

line 3
line 4
line 5

目标文件：

line 1
line 2
line 3
line 4
line 5
line 6

这应该匹配并返回 0，因为在目标文件中找到了内容文件中的行序列（以完全相同的顺序）。

编辑：抱歉，在此问题的先前版本中不清楚“模式”与“字符串”比较以及“输出”与“退出代码”。

【问题讨论】：

你在 Linux 上吗？还是您需要 MacOS/BSD 兼容性？
perl -0777 -pe 'exit 0 if s/'"$(cat patternfile)"'//; exit 1' targetfile?
@Cyrus Works，至少对于我刚刚进行的一些简单测试。请把它变成一个答案。
如果patternfile 包含/ 则不起作用。我敢肯定还有更好的解决方案。
这可能会有所帮助：How to know if a text file is a subset of another

标签： awk grep multilinestring

【解决方案1】：

单行：

$ if [ $(diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1 | wc -l) == $(cat patternfile | wc -l) ]; then echo "ok"; else echo "error"; fi

解释：

首先是使用diff比较两个文件：

diff --left-column -y patternfile targetfile
                                      > line 1
                                      > line 2
line 3                                (
line 4                                (
line 5                                (
                                      > line 6

然后过滤以仅显示感兴趣的行，即“（”的行，加上匹配前后的额外 1 行，以检查 patternfile 中的行是否匹配而没有中断。

diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 

                                      > line 2
line 3                                (
line 4                                (
line 5                                (
                                      > line 6

然后省略第一行和最后一行：

diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1

line 3                                (
line 4                                (
line 5                                (

添加一些代码来检查行数是否与patternfile中的行数匹配：

if [ $(diff --left-column -y patternfile targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1 | grep '(' | wc -l) == $(cat patternfile | wc -l) ]; then echo "ok"; else echo "error"; fi

ok

要将其与返回码一起使用，可以这样创建脚本：

#!/bin/bash
patternfile=$1                                                                                                          
targetfile=$2
if [ $(diff --left-column -y $patternfile $targetfile | grep '(' -A1 -B1 | tail -n +2 | head -n -1 | grep '(' | wc -l) == $(cat $patternfile | wc -l) ]; 
then 
   exit 0; 
else 
   exit 1; 
fi

测试（当上面的脚本命名为comparepatterns）：

$ comparepatterns patternfile targgetfile
echo $?
0

【讨论】：

【解决方案2】：

跟进Cyrus 的评论，他指向How to know if a text file is a subset of another，下面的 Python 单行代码可以解决问题

python -c "content=open('content').read(); target=open('target').read(); exit(0 if content in target else 1);"

【讨论】：

【解决方案3】：

你没有说你想要一个正则表达式匹配还是字符串匹配，我们无法判断，因为你将搜索文件命名为“patternfile”，而“pattern”可能意味着任何东西，有时你暗示你想要做一个字符串匹配 (check if a multi-line _string_ exists) 但是你使用 grep 和 pcregpre 没有为字符串指定参数而不是正则表达式匹配。

在任何情况下，这些都可以在每个 UNIX 机器上的任何 shell 中使用任何 awk（包括 POSIX 标准 awk，并且您说过您想使用标准 UNIX 工具）来做任何您想做的事情：

对于正则表达式匹配：

$ cat tst.awk
NR==FNR { pat = pat $0 ORS; next }
{ tgt = tgt $0 ORS }
END {
    while ( match(tgt,pat) ) {
        printf "%s", substr(tgt,RSTART,RLENGTH)
        tgt = substr(tgt,RSTART+RLENGTH)
    }
}

$ awk -f tst.awk patternfile targetfile
line 3
line 4
line 5

对于字符串匹配：

$ cat tst.awk
NR==FNR { pat = pat $0 ORS; next }
{ tgt = tgt $0 ORS }
END {
    lgth = length(pat)
    while ( beg = index(tgt,pat) ) {
        printf "%s", substr(tgt,beg,lgth)
        tgt = substr(tgt,beg+lgth)
    }
}

$ awk -f tst.awk patternfile targetfile
line 3
line 4
line 5

话虽如此，使用 GNU awk 如果您对模式文件内容的正则表达式匹配和反斜杠解释没问题（因此 \t 被视为文字制表符），您可以执行以下操作：

$ awk -v RS="$(cat patternfile)" 'RT!=""{print RT}' targetfile
line 3
line 4
line 5

或使用 GNU grep：

$ grep -zo "$(cat patternfile)" targetfile | tr '\0' '\n'
line 3
line 4
line 5

还有许多其他选项，具体取决于您真正想要进行的匹配类型以及可用的工具版本。

【讨论】：

无论匹配是什么，您的两个建议都以退出状态 0 返回。我需要能够在 bash/makefile 中检查结果。
没错。您当然应该在您的问题中说明这一点，但这绝对是一个微不足道的调整 - 如果您在实施该问题时遇到任何困难，并且如果您确实需要帮助，请确保编辑您的问题以包含所有相关信息，包括是否您正在尝试进行字符串或正则表达式匹配，应该将什么打印到 stdout/stderr，退出状态应该是什么，等等。
必须提一下，如果你有一个 EByte 日期文件，第一种情况会失败。 ;-)。非常好的 RS 解决方案。

【解决方案4】：

最简单的方法是使用滑动窗口。首先读取模式文件，然后是要搜索的文件。

(FNR==NR) { a[FNR]=$0; n=FNR; next }
{ b[FNR]=$0 }
(FNR >= n) { for(i=1; i<=n;++i) if (a[i] != b[FNR-n+i]) { delete b[FNR-n+1]; next}}
{ print "match at", FNR-n+1}
{ r=1}
END{ exit !r}

你称之为

awk -f script.awk patternFile searchFile

【讨论】：

【解决方案5】：

awk 中的另一种解决方案：

echo $(awk 'FNR==NR{ a[$0]; next}{ x=($0 in a)?x+1:0 }x==length(a){ print "OK" }' patternfile targetfile )

如果有匹配则返回“OK”。

【讨论】：

【解决方案6】：

编辑： 由于 OP 需要以真假（是或否）形式的命令结果，因此现在以这种方式编辑命令（在 GNU 中创建和测试awk)。

awk -v message="yes" 'FNR==NR{a[$0];next} ($0 in a){if((FNR-1)==prev){b[++k]=$0} else {delete b;k=""}} {prev=FNR}; END{if(length(b)>0){print message}}'  patternfile  targetfile

您能否尝试以下，用给定的样本进行测试，如果它们在目标文件中以相同的顺序出现，它应该打印模式文件中的所有连续行（此代码中连续行的计数应至少为 2）。

awk '
FNR==NR{
  a[$0]
  next
}
($0 in a){
  if((FNR-1)==prev){
      b[++k]=$0
  }
  else{
      delete b
      k=""
  }
}
{
  prev=FNR
}
END{
  for(j=1;j<=k;j++){
      print b[j]
  }
}'  patternfile  targetfile

说明：在此处添加对上述代码的说明。

awk '                                     ##Starting awk program here.
FNR==NR{                                  ##FNR==NR will be TRUE when first Input_file is being read.
  a[$0]                                   ##Creating an array a with index $0.
  next                                    ##next will skip all further statements from here.
}
($0 in a){                                ##Statements from here will will be executed when 2nd Input_file is being read, checking if current line is present in array a.
  if((FNR-1)==prev){                      ##Checking condition if prev variable is equal to FNR-1 value then do following.
      b[++k]=$0                           ##Creating an array named b whose index is variable k whose value is increment by 1 each time it comes here.
  }
  else{                                   ##Mentioning else condition here.
      delete b                            ##Deleting array b here.
      k=""                                ##Nullifying k here.
  }
}
{
  prev=FNR                                ##Setting prev value as FNR value here.
}
END{                                      ##Starting END section of this awk program here.
  for(j=1;j<=k;j++){                      ##Starting a for loop here.
      print b[j]                          ##Printing value of array b whose index is variable j here.
  }
}'  patternfile  targetfile               ##mentioning Input_file names here.

【讨论】：

如果你能解释一下这个解决方案是如何工作的，那就太棒了。
@codeforester，现在已经添加了解决方案的解释，干杯。
当模式文件的最后一行更改为line 6时，此脚本将输出line3; line 4; line 6。哪个不是我们想要的输出？
@Luuk，我相信恕我直言，这应该是所需的输出，因为所有输出都以相同的顺序出现，OP 可以确认。
当没有匹配时，您的awk 语句的返回码仍然是0（或OK）。通常可以/应该在语句之后检查返回码echo $?，仅返回“是”是不够的。我的one-liner 也有同样的问题，我稍后会编辑它...?