通过搜索每一行的所有列来提取某些模式，并将它们写入输出文件的指定列中答案

【问题标题】：Extracting certain patterns by searching all columns for each line, and writing them in specified columns in output file通过搜索每一行的所有列来提取某些模式，并将它们写入输出文件的指定列中
【发布时间】：2020-12-08 01:09:19
【问题描述】：

我是一名编程初学者，我接到了一项任务，其中应使用 awk 提取文本文件中“INFO”列中的某些字符串。代码如下：

awk -F '\t'  '/^[^#]/ {n=split($8,a,/[;]/); for(i=1;i<=n;i++) {if(a[i] ~ /^CLNDN=/) printf("%s\t",a[i]); else if(a[i] ~ /^CLNREVSTAT=/) printf("%s\t",a[i]); else if(a[i] ~ /^CLNSIG=/) printf("%s\t",a[i]);else if(a[i] ~ /^CLNSIGCONF=/) printf("%s\t",a[i]); else if(a[i] ~ /^ORIGIN=/) printf("%s\t",a[i]); } printf("\n");}' test.vcf > trial.vcf

这里的输入文件：

1   879375  950448  C   T   .   .   ALLELEID=929884;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CLNHGVS=NC_000001.10:g.879375C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Pathogenic;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=SAMD11:148398;MC=SO:0001587|nonsense;ORIGIN=1

1   955619  210112  G   C   .   .   AF_EXAC=0.03475;AF_TGP=0.00879;ALLELEID=206690;CLNDISDB=MONDO:MONDO:0014052,MedGen:C3808739,OMIM:615120|MedGen:CN169374|MedGen:CN517202;CLNDN=Myasthenic_syndrome,_congenital,_8|not_specified|not_provided;CLNHGVS=NC_000001.10:g.955619G>C;CLNREVSTAT=criteria_provided,_conflicting_interpretations;CLNSIG=Conflicting_interpretations_of_pathogenicity;CLNSIGCONF=Benign(1),Likely_benign(2),Uncertain_significance(1);CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=AGRN:375790;MC=SO:0001583|missense_variant;ORIGIN=1;RS=201073369

下面，你可以看到我想要得到的示例输出。

CLNDN=not_provided      CLNREVSTAT=criteria_provided,_single_submitter  CLNSIG=Pathogenic            ORIGIN=1           
CLNDN=Myasthenic_syndrome,_congenital,_8|not_specified|not_provided     CLNREVSTAT=criteria_provided,_conflicting_interpretations       CLNSIG=Conflicting_interpretations_of_pathogenicity         CLNSIGCONF=Benign(1),Likely_benign(2),Uncertain_significance(1) ORIGIN=1

第一行的 CLNSIG 和 ORIGIN 之间有一个间隙，因为该行不包含 CLNSIGCONF= 信息。 这里，我想提取以CLDN=、CLNREVSTAT=、CLNSIG=、CLNSIGCONF=和ORIGIN=开头的字符串，分别打印到输出文件的1-5列。 该代码能够提取兴趣，但我坚持将它们打印到指定的列中。

如果您能帮助我，我将不胜感激（并且非常愿意接受任何建议）。

非常感谢您。

【问题讨论】：

很高兴你在你的问题中表现出了你的努力。您能否发布您的 Input_file 示例以便更好地理解问题。
请求您更新您的问题，以便更好地理解 CODE TAGS 中的示例。
我更新了帖子。
不会是像CLNDN,CLNREVSTAT,etc.这样的字段名称作为标题行然后只是它们下面的值比包含字段名称和它的值的每一行更好的输出格式吗？

标签： awk

【解决方案1】：

编辑： 如果您的任何元素可能在行中丢失，请尝试以下操作。如果在任何行中都找不到匹配项，这也会打印一条语句（如果您想删除它，然后从该解决方案中删除 if(cldn=="" &&....) 块。

awk '
BEGIN{
  OFS="\t"
}
match($0,/CLNDN=[^;]*/){
  cldn=substr($0,RSTART,RLENGTH)
}
match($0,/CLNREVSTAT=[^;]*/){
  clnrevstat=substr($0,RSTART,RLENGTH)
}
match($0,/CLNSIG=[^;]*/){
  clnsig=substr($0,RSTART,RLENGTH)
}
match($0,/CLNSIGCONF=[^;]*/){
  clnsisconf=substr($0,RSTART,RLENGTH)
}
match($0,/ORIGIN=[^;]*/){
  origin=substr($0,RSTART,RLENGTH)
}
NF{
  if(cldn=="" && clnrevstat=="" && clnsig=="" && clnsisconf=="" && origin==""){
    print "NO matched value found in this line."
    next
  }
  print cldn,clnrevstat,clnsig,clnsisconf,origin
  cldn=clnrevstat=clnsig=clnsisconf=origin=""
  next
}
1
'  Input_file

您能否尝试以下，根据您在 GNU awk 中显示的示例编写。

awk '
BEGIN{
  OFS="\t"
}
{ cldn=clnrevstat=clnsig=clnsisconf="" }
match($0,/CLNDN=[^;]*/){
  cldn=substr($0,RSTART,RLENGTH)
}
match($0,/CLNREVSTAT=[^;]*/){
  clnrevstat=substr($0,RSTART,RLENGTH)
}
match($0,/CLNSIG=[^;]*/){
  clnsig=substr($0,RSTART,RLENGTH)
}
match($0,/CLNSIGCONF=[^;]*/){
  clnsisconf=substr($0,RSTART,RLENGTH)
}
match($0,/ORIGIN=[^;]*/){
  print cldn,clnrevstat,clnsig,clnsisconf,substr($0,RSTART,RLENGTH)
}
' Input_file

说明：为上述添加详细说明。

awk '                                    ##Starting awk program from here.
BEGIN{                                   ##Starting BEGIN section of this program from here.
  OFS="\t"                               ##Setting OFS as tab here.
}
{ cldn=clnrevstat=clnsig=clnsisconf="" }
match($0,/CLNDN=[^;]*/){                 ##Using match function to match from string CLNDN= till semi colon here.
  cldn=substr($0,RSTART,RLENGTH)         ##Creating cldn which has matched regex sub string.
}
match($0,/CLNREVSTAT=[^;]*/){            ##Using match function to match from string CLNREVSTAT= till semi colon here.
  clnrevstat=substr($0,RSTART,RLENGTH)   ##Creating clnrevstat which has matched regex sub string here.
}
match($0,/CLNSIG=[^;]*/){                ##Using match function to match from string CLNSIG= till semi colon here.
  clnsig=substr($0,RSTART,RLENGTH)       ##Creating clnsig which has matched regex sub string here.
}
match($0,/CLNSIGCONF=[^;]*/){            ##Using match function to match from string CLNSIGCONF= till semi colon here.
  clnsisconf=substr($0,RSTART,RLENGTH)   ##Creating clnsisconf which has matched regex sub string here.
}
match($0,/ORIGIN=[^;]*/){                ##Using match function to match from string ORIGIN= till semi colon here.
  print cldn,clnrevstat,clnsig,clnsisconf,substr($0,RSTART,RLENGTH)
                                         ##Printing all variables value and sub string of matched regex.
}
' Input_file                             ##Mentioning Input_file name here.

【讨论】：

【解决方案2】：

只要您的数据中有tag=value 对，最好先构建该映射的数组（下面的f[]），然后通过它们的标签（名称）打印您想要的值：

$ cat tst.awk
BEGIN { OFS="\t" }
NF {
    delete f
    split($NF,tagVals,/;/)
    for (i in tagVals) {
        tag = tagVals[i]
        sub(/=.*/,"",tag)
        f[tag] = tagVals[i]
    }
    print f["CLNDN"], f["CLNREVSTAT"], f["CLNSIG"], f["CLNSIGCONF"], f["ORIGIN"]
}

$ awk -f tst.awk file
CLNDN=not_provided      CLNREVSTAT=criteria_provided,_single_submitter  CLNSIG=Pathogenic               ORIGIN=1
CLNDN=Myasthenic_syndrome,_congenital,_8|not_specified|not_provided     CLNREVSTAT=criteria_provided,_conflicting_interpretations     CLNSIG=Conflicting_interpretations_of_pathogenicity     CLNSIGCONF=Benign(1),Likely_benign(2),Uncertain_significance(1)       ORIGIN=1

FWIW 我认为你应该这样做，而不是让每一行中的每个字段都包含标签和值：

$ cat tst.awk
BEGIN {
    OFS = "\t"
    n = split("CLNDN CLNREVSTAT CLNSIG CLNSIGCONF ORIGIN",tags)
    for (i=1; i<=n; i++) {
        tag = tags[i]
        printf "%s%s", tag, (i<n ? OFS : ORS)
    }
}
{
    delete tag2val

    split($NF,tagVals,/;/)
    for (i in tagVals) {
        tag = val = tagVals[i]
        sub(/=.*/,"",tag)
        sub(/[^=]+=/,"",val)
        tag2val[tag] = val
    }

    for (i=1; i<=n; i++) {
        tag = tags[i]
        val = tag2val[tag]
        printf "%s%s", val, (i<n ? OFS : ORS)
    }
}

$ awk -f tst.awk file
CLNDN   CLNREVSTAT      CLNSIG  CLNSIGCONF      ORIGIN
not_provided    criteria_provided,_single_submitter     Pathogenic              1
Myasthenic_syndrome,_congenital,_8|not_specified|not_provided   criteria_provided,_conflicting_interpretationsConflicting_interpretations_of_pathogenicity    Benign(1),Likely_benign(2),Uncertain_significance(1)    1

【讨论】：

非常感谢 Ed 的解决方案！但我没有得到一件事：(i<n ? OFS : ORS) 是什么意思？
它只是一个常见的三元表达式（参见https://en.wikipedia.org/wiki/%3F:），用于在除最后一行之外的每个字段之后打印OFS，然后打印ORS。