在单独的字段中打印字符串匹配之间的 n 行答案

【问题标题】：Printing n number of lines between string matches in a separate field在单独的字段中打印字符串匹配之间的 n 行
【发布时间】：2020-03-30 13:13:54
【问题描述】：

我有两个文件，File_1 包含 ID，File_2 包含我想要匹配的数据。每个文件如下所示：

File_1
Cluster 43
Cluster 51
Cluster 145
Cluster 160

File_2
>Cluster 43
0       5249nt, >CL276.Contig2_All... at +/98.55%
1       6413nt, >CL276.Contig3_All... *
2       5375nt, >CL276.Contig5_All... at +/95.91%
3       5405nt, >CL276.Contig6_All... at +/98.33%
>Cluster 51
0       6298nt, >CL5173.Contig2_All... *
1       3421nt, >CL5173.Contig3_All... at +/99.50%
2       1017nt, >CL5173.Contig4_All... at +/98.13%
3       503nt, >Unigene10077_All... at +/98.01%
>Cluster 145
0       4772nt, >CL1798.Contig5_All... at +/98.49%
1       4782nt, >CL1798.Contig8_All... *
2       4781nt, >CL1798.Contig10_All... at +/99.27%
3       4773nt, >CL1798.Contig11_All... at +/99.25%
>Cluster 160
0       2883nt, >CL4790.Contig2_All... at +/95.87%
1       4699nt, >CL4790.Contig3_All... *
2       1274nt, >CL4790.Contig7_All... at +/99.37%
3       4616nt, >CL4790.Contig14_All... at -/95.65%

我需要在 File_2 中找到与 File_1 中的 ID 匹配的所有行并打印匹配行，例如“Cluster 43”，并在单独的字段中打印匹配字符串之间的所有行。所需的输出应如下所示

Cluster 43 5249nt CL276.Contig2_All
           6413nt CL276.Contig3_All
           5375nt CL276.Contig5_All
           5405nt CL276.Contig6_All
Cluster 51 6298nt CL5173.Contig2_All
           3421nt CL5173.Contig3_All
           1017nt CL5173.Contig4_All
           503nt Unigene10077_All

使用我的命令行（见下文）我可以处理文件以根据来自 File_1 的 ID 获取 File_2 中的匹配行，打印匹配行和匹配之间的所有行，并从每一行中删除所有不需要的信息，但是，我在寻找解决方案以在单独的字段中打印匹配之间的行时遇到问题，如上所示的所需输出。

我的命令行

$ grep -A4 -Fwf File_1 File_2 | sed 's/All.*//g;/Contig/s/.$/_All/g;/Unigene/s/.$/_All/g;s/-.*//;/^$/d;s/.//;s/[,>]//g' | awk '{print $1, $2}' > my_wanted_file

结果输出

$head my_wanted_file

Cluster 43
5249nt CL276.Contig2_All
6413nt CL276.Contig3_All
5375nt CL276.Contig5_All
5405nt CL276.Contig6_All
Cluster 51
6298nt CL5173.Contig2_All
3421nt CL5173.Contig3_All
1017nt CL5173.Contig4_All
503nt Unigene10077_All

为了实现我的目标，我编写了以下命令行：

$ awk '/^Cluster/ {if ('\n') {printf NR==4}}' my_wanted_file | head

但它什么也没打印。

然后我尝试了：

$ awk '/Cluster/ {for(i=1; i<=4; i++) {getline; print}}' my_wanted_file | head

但它只连续打印每个匹配（集群）之间的行，如下所示

5249nt CL276.Contig2_All
6413nt CL276.Contig3_All
5375nt CL276.Contig5_All
5405nt CL276.Contig6_All
6298nt CL5173.Contig2_All
3421nt CL5173.Contig3_All
1017nt CL5173.Contig4_All
503nt Unigene10077_All
4772nt CL1798.Contig5_All
4782nt CL1798.Contig8_All

我找不到从这里走的路

Cluster 43
5249nt CL276.Contig2_All
6413nt CL276.Contig3_All
5375nt CL276.Contig5_All
5405nt CL276.Contig6_All
Cluster 51
6298nt CL5173.Contig2_All
3421nt CL5173.Contig3_All
1017nt CL5173.Contig4_All
503nt Unigene10077_All

到这里

Cluster 43 5249nt CL276.Contig2_All
           6413nt CL276.Contig3_All
           5375nt CL276.Contig5_All
           5405nt CL276.Contig6_All
Cluster 51 6298nt CL5173.Contig2_All
           3421nt CL5173.Contig3_All
           1017nt CL5173.Contig4_All
           503nt Unigene10077_All

在这方面我非常感谢您的帮助。

【问题讨论】：

我在file2中看不到Cluster 43，能否请您更准确地更正您的示例，以便我们更好地理解它。顺便说一句，感谢您在问题中付出的努力，请继续关注并在编辑您的帖子后告诉我们。
因为File_2是一个大文件，我只展示了前几行，File_1中的字符串在File_2中进一步下降。
是的，只显示几行样本就很好了。我的要求是发布公共行等，以便我们弄清楚如何匹配两个文件并获得所需的输出，因此显示的输入和输出应该是同步的。
@RavinderSingh13 我已更改为 File_2 显示的行。现在两个文件中的信息匹配。感谢您的要求
你真的应该再看看my answer to your previous question，它也会让你现在做的事情变得更容易，因为它只是将每个集群块视为一条记录。

标签： string awk sed grep

【解决方案1】：

请您尝试以下操作。虽然我不确定您是否需要直到 All 或字符串 at 之前的值，或者 All 是否真的出现在您的 Input_file 中，所以我们也可以相应地更改正则表达式。

这也将处理所有行的相等空格。

awk '
FNR==NR{
  $0=">"$0
  a[$0]
  max=max>length($0)?max:length($0)
  next
}
FNR==1 && FNR!=NR{
  spaces=sprintf("%-"max+1"s",OFS)
}
/^>/{
  found=val=count=""
}
/^>/ && $0 in a{
  found=1
  val= $0
  remain_spaces=sprintf("%-"max-length($0)+1"s",OFS)
  next
}
found{
  gsub(/^>|at.*/,"",$3)
  sub(/,/,"",$2)
  printf("%s\n",++count==1?val remain_spaces $2 OFS $3:spaces $2 OFS $3)
}
'  Input_file1  Input_file2

输出如下。

>Cluster 43  5249nt CL276.Contig2_All...
             6413nt CL276.Contig3_All...
             5375nt CL276.Contig5_All...
             5405nt CL276.Contig6_All...
>Cluster 51  6298nt CL5173.Contig2_All...
             3421nt CL5173.Contig3_All...
             1017nt CL5173.Contig4_All...
             503nt Unigene10077_All...
>Cluster 145 4772nt CL1798.Contig5_All...
             4782nt CL1798.Contig8_All...
             4781nt CL1798.Contig10_All...
             4773nt CL1798.Contig11_All...
>Cluster 160 2883nt CL4790.Contig2_All...
             4699nt CL4790.Contig3_All...
             1274nt CL4790.Contig7_All...
             4616nt CL4790.Contig14_All...

【讨论】：

亲爱的@RavinderSingh13 输出如下所示： Cluster 43 5249nt CL276.Contig2_All... 6413nt CL276.Contig3_All... 5375nt CL276.Contig5_All... 5405nt CL276.Contig6_All... Cluster 43 44 6380nt CL1653.Contig4_All... 1074nt Unigene864_All... 1069nt Unigene42819_All... 簇 43 45 6380nt CL4699.Contig1_All...
@LeonardoMartin，对不起，但没有代码标签无法理解。现在检查我的编辑，如果这是你的问题，我修复了空格部分，让我知道吗？
亲爱的@RavinderSingh13 现在可以完美运行了。是的，我需要值，直到 _All 与您的示例输出完全相同。非常感谢您的帮助。
@LeonardoMartin，欢迎您，很高兴它对您有所帮助，干杯。

【解决方案2】：

这是您想要做的吗（使用 GNU awk 表示多字符 RS、ENDFILE 和 \s 简写为 [[:space:]]）？

$ cat tst.awk
NR==FNR {
    tgts[$0]
    next
}
ENDFILE {
    RS = "(^|\n)(>|$)"
    FS = "\n"
}
(FNR > 1) && ($1 in tgts) {
    gsub(/\n[0-9]+\s+/,"\n")
    gsub(/[,>]|[.]{3}[^\n]*/,"")
    for (i=2; i<=NF; i++) {
        print $1, $i
        gsub(/./," ",$1)
    }
}

$ awk -f tst.awk File_1 File_2
Cluster 43 5249nt CL276.Contig2_All
           6413nt CL276.Contig3_All
           5375nt CL276.Contig5_All
           5405nt CL276.Contig6_All
Cluster 51 6298nt CL5173.Contig2_All
           3421nt CL5173.Contig3_All
           1017nt CL5173.Contig4_All
           503nt Unigene10077_All
Cluster 145 4772nt CL1798.Contig5_All
            4782nt CL1798.Contig8_All
            4781nt CL1798.Contig10_All
            4773nt CL1798.Contig11_All
Cluster 160 2883nt CL4790.Contig2_All
            4699nt CL4790.Contig3_All
            1274nt CL4790.Contig7_All
            4616nt CL4790.Contig14_All

【讨论】：

亲爱的 Ed Morton，您的解决方案也非常有效。感谢您和@RavinderSingh13 抽出宝贵时间。我也在投票给你的答案。

【解决方案3】：

假设输入是：

Cluster 43
5249nt CL276.Contig2_All
6413nt CL276.Contig3_All
5375nt CL276.Contig5_All
5405nt CL276.Contig6_All
Cluster 51
6298nt CL5173.Contig2_All
3421nt CL5173.Contig3_All
1017nt CL5173.Contig4_All
503nt Unigene10077_All

假设每个集群正好有 5 行，你可以运行它：

sed 'N;s/[[:blank:]]*\n[[:blank:]]*/|/; n;s/^/|/; n;s/^/|/; n;s/^/|/'

或者像这样，看起来更短，不知道是否更好：

sed 's/^/|/' | sed 's/^|//;N;s/\n//;N;N;N'

获得：

Cluster 43|5249nt CL276.Contig2_All
|6413nt CL276.Contig3_All
|5375nt CL276.Contig5_All
|5405nt CL276.Contig6_All
Cluster 51|6298nt CL5173.Contig2_All
|3421nt CL5173.Contig3_All
|1017nt CL5173.Contig4_All
|503nt Unigene10077_All

您可以选择其他分隔符然后|。现在你可以通过column 运行它来列化它：

column -t -s '|' -o ' '

将输出：

Cluster 43 5249nt CL276.Contig2_All
           6413nt CL276.Contig3_All
           5375nt CL276.Contig5_All
           5405nt CL276.Contig6_All
Cluster 51 6298nt CL5173.Contig2_All
           3421nt CL5173.Contig3_All
           1017nt CL5173.Contig4_All
           503nt Unigene10077_All

我测试它的整个命令如下所示：

cat <<EOF |
Cluster 43
5249nt CL276.Contig2_All
6413nt CL276.Contig3_All
5375nt CL276.Contig5_All
5405nt CL276.Contig6_All
Cluster 51
6298nt CL5173.Contig2_All
3421nt CL5173.Contig3_All
1017nt CL5173.Contig4_All
503nt Unigene10077_All
EOF
sed 'N;s/[[:blank:]]*\n[[:blank:]]*/|/; n;s/^/|/; n;s/^/|/; n;s/^/|/' | column -t -s '|' -o ' '

【讨论】：

亲爱的 KamilCuk，它没有按预期工作。我收到以下警告栏：无效选项 -- 'o' 但还是感谢您抽出宝贵时间
我猜你在bsd上。安装 gnu 列，或者您可以删除 -o ' '。