【问题标题】:How to delete lines from file with structure criteria如何从具有结构标准的文件中删除行
【发布时间】:2019-01-15 19:21:51
【问题描述】:

我有一个结构突然的文件,当结构不符合时,我想删除这些行。所以结构应该是:1)一行以“Sequence”开头,2)一行以“Start”开头,3)一行以数字开头。

现在在我的文件中,有些行没有数字,但有前两行(数字行已用 grep 删除)。我希望找到一种方法,使用 awk 或 sed,在没有数字行的情况下删除前两行。希望这是可能的吗?

cat file.txt
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

预期输出:

cat file.txt
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

【问题讨论】:

  • 你能展示预期的输出和你的尝试吗?
  • 叹息。调整 my previous answer 中的 awk 脚本是非常简单的。这就是我试图警告您关于使用 sed 执行此类任务的 wrt - 现在,您需要一个完全不同的解决方案,您需要一个完全不同的解决方案。

标签: bash awk sed


【解决方案1】:

您可以使用这个awk 命令:

awk '/^[0-9]+/ && NR==a["Sequence:"]+2 && NR==a["Start"]+1 {
   print r["Sequence:"] ORS r["Start"] ORS $0
}
/^(Sequence:|Start)/ {
   a[$1]=NR
   r[$1]=$0
}' file

Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

【讨论】:

  • 这很好用,谢谢。您为 awk 使用什么手册/参考资料?我想更多地使用 awk,但很难找到好的参考。
  • grymoire.com/Unix/Awk.htmlgnu.org/s/gawk/manual/gawk.pdf 开头,然后您可以查看一些关于SO 本身的awk 答案。
  • @b.nota 您可以在标记 wiki 中找到手册和学习资源...stackoverflow.com/tags/awk/infostackoverflow.com/tags/sed/info
  • 购买 Arnold Robbins 所著的《Effective Awk Programming, 4th Edition》一书。 Arnold 投入了大量时间和精力来提供 GNU awk 和文档,他足以使该文档在线提供以供参考,并且他获得的唯一补偿是图书销售收入,因此请务必使用上面提供的链接作为参考,但请支持 Arnolds 继续努力并购买这本书。
【解决方案2】:
% awk '
  $1 == "Sequence:" {seq   = $0}
  $1 == "Start"     {start = $0}
  $1 ~ /^[0-9]*$/ && l "Start" && L == "Sequence:" {print seq;print start;print}
  {L = l;}
  {l = $1}' file.txt

【讨论】:

    【解决方案3】:

    对于可以放入内存的文件,您可以 slurp 整个文件并处理

    perl -0777 -pe 's/^Sequence.*\nStart.*\n(?!\d)//m' ip.txt
    
    • -0777 啜食整个文件
    • m 标志,以便 ^$ 锚也可以在多行字符串中工作
    • ^Sequence.*\nStart.*\n(?!\d) 匹配 ^Sequence.*\nStart.*\n 仅当它后面没有数字时。请注意,. 不会匹配换行符,除非使用了 s 标志

    或者,您可以只匹配并打印正确的分组

    perl -0777 -ne 'print /^Sequence.*\nStart.*\n\d.*\n/mg' ip.txt
    

    【讨论】:

      【解决方案4】:

      只打印 3 行记录,您只需要:

      $ cat tst.awk
      /^Sequence:/ { lineNr=0; rec="" }
      { rec = (++lineNr > 1 ? rec ORS : "") $0 }
      lineNr == 3 { print rec }
      

      例如:

      $ awk -f tst.awk file
      Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
      Start     End  Strand Pattern                 Mismatch Sequence
      217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
      Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
      Start     End  Strand Pattern                 Mismatch Sequence
      217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
      Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
      Start     End  Strand Pattern                 Mismatch Sequence
      176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
      Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
      Start     End  Strand Pattern                 Mismatch Sequence
      176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
      

      但要获得更有用的数据分析方法,请再次查看my answer to your previous question 底部的脚本。要调整它以丢弃少于 3 行的记录,您只需将 lineNr=0 设置从 lineNr==3 块内移动到新的 /Sequence:/ 块,脚本将继续工作以提供一个数组您可以按名称访问字段:

      $ cat tst.awk
      /^Sequence:/ { lineNr = 0 }
      
      ++lineNr == 1 {
          delete fldNr2tag
          delete tagNr2tag
          delete tag2val
          numTags = 0
      
          for (i=1; i<=NF; i+=2) {
              sub(/:.*/,"",$i)
              tag = $i (i>1 ? "" : 1) # to distinguish the 2 "Sequence" tags
              val = $(i+1)
              tagNr2tag[++numTags] = tag
              tag2val[tag] = val
          }
      }
      lineNr == 2 {
          for (i=1; i<=NF; i++) {
              tag = $i
              fldNr2tag[i] = tag
          }
      }
      lineNr == 3 {
          for (i=1; i<=NF; i++) {
              tag = fldNr2tag[i]
              val = $i
              tagNr2tag[++numTags] = tag
              tag2val[tag] = val
          }
      
          prt()
      }
      
      function prt(   tagNr, tag, val) {
          for (tagNr=1; tagNr<=numTags; tagNr++) {
              tag = tagNr2tag[tagNr]
              val = tag2val[tag]
              printf "tag2val[%s] = <%s>\n", tag, val
          }
          print "----"
      }
      

      .

      $ awk -f tst.awk file
      tag2val[Sequence1] = <HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_>
      tag2val[from] = <1>
      tag2val[to] = <296>
      tag2val[Start] = <217>
      tag2val[End] = <225>
      tag2val[Strand] = <+>
      tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
      tag2val[Mismatch] = <.>
      tag2val[Sequence] = <aacacctcc>
      ----
      tag2val[Sequence1] = <MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___>
      tag2val[from] = <1>
      tag2val[to] = <296>
      tag2val[Start] = <217>
      tag2val[End] = <225>
      tag2val[Strand] = <+>
      tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
      tag2val[Mismatch] = <.>
      tag2val[Sequence] = <aacacctcc>
      ----
      tag2val[Sequence1] = <L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___>
      tag2val[from] = <1>
      tag2val[to] = <301>
      tag2val[Start] = <176>
      tag2val[End] = <184>
      tag2val[Strand] = <+>
      tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
      tag2val[Mismatch] = <.>
      tag2val[Sequence] = <aatactaca>
      ----
      tag2val[Sequence1] = <X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__>
      tag2val[from] = <1>
      tag2val[to] = <290>
      tag2val[Start] = <176>
      tag2val[End] = <184>
      tag2val[Strand] = <+>
      tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
      tag2val[Mismatch] = <.>
      tag2val[Sequence] = <aatactaca>
      ----
      

      如果您只想按原样打印输入行,那就更简单了,但我真的认为上面是您想要添加各种比较和输出组合的内容。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-05-14
        • 1970-01-01
        • 1970-01-01
        • 2021-07-02
        • 2014-06-12
        • 2020-08-14
        • 2016-03-12
        • 1970-01-01
        相关资源
        最近更新 更多