【问题标题】:Include sequential numbering to matching text在匹配文本中包含顺序编号
【发布时间】:2017-10-18 15:52:00
【问题描述】:

我有一个文件目前看起来像这样,例如:

>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001264_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001264_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001369_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
>ENSOFAS004569_p |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341

关注>| 符号之间的文本部分,我想添加基于匹配ENSOFAS 数字ID 的顺序编号。也就是说,我想把它变成这样:

>ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
>ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
>ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
>ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341

我可以在 textwrangler (>ENSOFAS(\d+)_p (.+)\r) 中使用 grep 进行搜索,但我知道文本编辑器无法在 _p 之后添加数字。我认为搜索部分的 macOS linux 版本可能是grep -E ">ENSOFAS[0-9]\{6\}_p\s|",但不知道如何在_p| 之前的空白之间获取编号。匹配的 ENSOFAS 数字不会在文本文件中聚集在一起,但如果需要,我可以采用某种排序方式。

【问题讨论】:

    标签: search replace grep


    【解决方案1】:

    awk方法:

    awk '{ $1=$1""++a[$1] }1' file
    

    输出:

    >ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
    >ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
    >ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
    >ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
    >ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
    >ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341
    

    awksub() 函数的替代方法:

    awk '{ sub(/$/,++a[$1],$1) }1' file
    

    【讨论】:

      【解决方案2】:

      如果awk 是您设置中的一个选项:

      $ awk '{cnt[$1]++; $1=$1""cnt[$1]; print}' file
      >ENSOFAS001369_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
      >ENSOFAS001369_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Anoplocnemis_curvipes_contig204
      >ENSOFAS001264_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
      >ENSOFAS001264_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001264,probes-probe:,probes-source:Anoplocnemis_curvipes_contig6359
      >ENSOFAS001369_p3 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS001369,probes-probe:,probes-source:Alydus_pilosus_contig594
      >ENSOFAS004569_p1 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS004569,probes-probe:,probes-source:Alydus_pilosus_contig30341
      

      解释:$1 将包含第一个字段(对于每一行),例如 >ENSOFAS001369_p。我们使用关联数组 cnt 来计算来自 $1 的每个唯一标记的出现次数,并修改字段 $1(之前的输出)以包含处理的记录/行的当前计数。

      awk 脚本可以缩短,但这种形式可能更易于阅读和理解。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-09-07
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多