匹配shell脚本中多行的表达式答案

【问题标题】：Match expression across multiple lines in shell script匹配shell脚本中多行的表达式
【发布时间】：2017-06-06 20:38:14
【问题描述】：

我希望在 shell 脚本中匹配多行的模式。我的输入是：

START <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
ID: n1 <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
END

START <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
ID: n2 <some data including white spaces>
<some data including white spaces, can span across multiple lines, number of lines are variable>
END

我正在尝试使用正则表达式仅针对特定 ID（例如 n1 或 n2）显示输出。我尝试了START(.|\n)*ID: n1(.|\n)*END regex，但它也获取了 ID: n2 的数据。我应该对正则表达式进行哪些更改才能仅获取特定 ID 的数据？

我使用cat inputfile | grep 'pattern' > outputfile 作为命令。

每个块中的行数以及START 和ID: n1、ID: n1 和END 之间的行数可以是可变的，因此使用head/tail 不是一个可行的选项。另外，当 ID 匹配时，我想打印从 START 到 END 的整个块。

编辑：我尝试使用Online Regex Creator，它可以成功匹配正则表达式

START[\s\S][^END]*ID: n1[\s\S][^END]*END

在我的输入文件上。

【问题讨论】：

Perl 可以接受吗？在 Perl 中很容易...

标签： regex shell scripting grep

【解决方案1】：

GNU awk 或 Mawk 解决方案，可以处理成对的 START 和END 出现次数：

awk -v id='n2' -v RS='(^|\n)START |\nEND' '
  $0 ~ ("\nID: " id " ") { print "START " $0 "\nEND" }
' file

^{此解决方案使用多字符 RS 值（这也是一个正则表达式），POSIX spec 不支持该值。然而，GNU awk 和 Mawk（Ubuntu 上的默认 awk）都支持这些值，而 BSD/macOS awk 不支持。}

-v id='n2' 将 ID 值 n2 作为变量 id 传递给 Awk。
RS='(^|\n)START |\nEND' 通过在输入/行开头的标记 START 和换行后的标记 END 之间的（跨行）文本将输入分成记录。
李>
$0 ~ ("\nID: " id " ") 将每个输入记录 ($0) 与匹配指定 ID 的正则表达式 (~) 匹配：换行符后跟 ID: ，然后是感兴趣的 ID 值（存储在变量id) 和一个空格。
请注意 Awk 中的字符串连接是如何通过简单地将字符串/变量引用放在一起来工作的。
在匹配的情况下，print "START " $0 "\nEND" 打印手头的输入记录，以START 和END 标记（作为输入记录分隔符，不作为@ 的一部分报告） 987654345@)。

如果配对的START和END之间的行都是非空（即，至少包含1个字符，即使那个 char. 是空格或制表符），这里有一个 POSIX-compliant awk 解决方案：

awk -v id='n2' -v RS= '$0 ~ ("\nID: " id " ")' file

请注意，-v RS=，即将输入记录分隔符 (RS) 设置为空字符串，是一个 awk 习惯用法，它通过段落将输入分成记录（运行非空行）。

【讨论】：

【解决方案2】：

awk 在段落模式下，使用两个连续的换行符作为记录分隔符：

awk -v RS='\n\n' '/ID: n1/' file.txt

将n1 替换为n2、n3... 对于其他人。

示例：

$ cat file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n1 <some data including white spaces>
<some data including white spaces>
END

START <some data including white spaces>
<some data including white spaces>
ID: n2 <some data including white spaces>
<some data including white spaces>
END

START <some data including white spaces>
<some data including white spaces>
ID: n3 <some data including white spaces>
<some data including white spaces>
END


$ awk -v RS='\n\n' '/ID: n1/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n1 <some data including white spaces>
<some data including white spaces>
END


$ awk -v RS='\n\n' '/ID: n2/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n2 <some data including white spaces>
<some data including white spaces>
END


$ awk -v RS='\n\n' '/ID: n3/' file.txt
START <some data including white spaces>
<some data including white spaces>
ID: n3 <some data including white spaces>
<some data including white spaces>
END

【讨论】：

START 和 ID: n1 以及 ID: n1 和 END 之间的行数是可变的。因此，使用\n\n 不会产生所需的结果。
@ChintanParikh 文本处理完全依赖于输入。请准确输入。
编辑了问题。为了更加确定，ID 之前和之后的行数是可变的。因此，数据块是START <any characters spanning variable number of lines including white spaces> ID: n1<any characters spanning variable number of lines including white spaces> END。我希望这能消除任何疑虑。

【解决方案3】：

在awk 中，您可以累积起始模式和结束模式之间的文本，然后测试该缓冲区以进行匹配：

cat inputfile | awk  '/^START/        { buf=$0 "\n"; flag=1; next } 
                      flag            { buf=buf $0 "\n" } 
                      /^END/ && flag  { flag=0; if (buf ~ /ID: n1 |ID: n2 /) print buf }'

在 Perl 中你可以这样做：

cat inputfile | perl -0777 -lne 'while (/(^START.*?^ID: (n\d+) .*?^END)/gms){
    if ($2 eq "n1" || $2 eq "n2"){
        print "$1\n\n";
    }
}'

在任何一种情况下，您都可能想要使用awk '{script}' inputfile 或perl '{script}' inputfile 而不是使用cat

【讨论】：

我是 Perl 新手，但在以 String found where expected operator 运行代码时遇到错误。
它适用于示例吗？请使用更相关的实际数据示例更新您的帖子。
我做的唯一改变是我使用cat inputfile而不是echo "$txt"
这可能是您的外壳上的外壳引用问题。你在使用 Bash 吗？你可以试试perl -0777 -e '{perl one liner}' inputfile vs 使用cat