【问题标题】:Print lines below a header matching a condition在匹配条件的标题下方打印行
【发布时间】:2021-11-26 08:42:00
【问题描述】:

我有一个这样的数据库:

>2654570298
MRNYSYKGKWEKLLTPEIVKKLTLINEFKGEQRLFIKAHKDELKELSELA
KIQSTEASNKIEGIFTSDDRFKSLAQAKTTPRNRNESEIAGYRDVLNTIH
DSYEYIPISASYFLQLHRDLYKFVAKNDVGKFKSSDNIIRETDEKGNERL
RFRPVPAWETPAAIDELCKAYADAKEEIDPLILNAMFILDFLCIHPFNDG
NGRMSRLLTLLLLYKTGFIVGKYISIEKIIEESKETYYEVLQDSLVGWHE
NENDYKPFVNYMLGVIVNAYKEFESRTELVTNPNLTKSDRIREIIKDHIG
TITKAELLEMNPDISDTTVQRTLAKLLKNNDIKKIGGGRYTKYTWNTEEQ

>2654570299|K03427
MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI

>2654570301
MNESELYKELGILTKDKSKWAENIQYVSSLLNHESAKIQAKALWLLGEMG
LEYPDSIQDAVPMVASFCDSENALLRERAVNALGRIGRGNYNLIEPYWSD
LFRFASDDEPKVRLSFIWASENVATNTPDIYENHMSVFESLLHDIDDKVR
MESPEIFRVLGKRRPEFVIPYIEQLQKMAETDSNRVVRIHSLGAIKVTTS
K

>2654570302
MWNMIWPLVLIVGSNCFYNICTKSMPEGTNTFGALTVTYLVGAVLSAVLF
VVSVKPAGVLNEISKINWTSFVLGLVIVGLEAGYVFLYRAGWKVSNGALT
ANICLAIALIVIGFLLYKESISIKQVAGIVVCGFGLFLING

>2654570303|K01153
MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
HRIKNSKKED

我想过滤它,只打印标题包含“|K”的序列,使用 awk、grep 或类似的东西。期望的输出:

>2654570299|K03427
MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI

>2654570303|K01153
MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
HRIKNSKKED

请注意,一个标题和下一个标题之间的行数并不总是相同的,换行符总是将一个序列和后面的标题分开。

有人可以帮忙吗?

【问题讨论】:

    标签: bash awk grep


    【解决方案1】:

    同样使用awk这可以帮助你:

    awk '/^>/ {f=/\|K/} f' file
    >2654570299|K03427
    MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
    ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
    FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
    VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
    GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
    KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
    ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
    NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI
    
    >2654570303|K01153
    MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
    HRIKNSKKED
    
    • 如果f 为真,则打印这些行。默认情况下,在awk 中,当条件评估为真时,会打印 $0 的内容。
    • 您可以通过print 看到f 的值。
    • 以及哪些行或记录是真实的:
    awk '/^>/ {f=/\|K/} f {print NR, f}' file
    

    【讨论】:

      【解决方案2】:

      使用 awk 或 sed:

      sed -e '/|K/, /^$/ p; d' database.txt
      awk '/\|K/, /^$/' database.txt
      

      它们都做完全相同的事情——它们检查一行上的|K 并打印,直到他们看到下一个空白行。在 sed 语法中,打印是显式的 p(后面的 d 清除缓冲区以移动到下一个输入行),而 awk 示例利用了更隐式的 awk“默认操作”行为。

      这两个工具在匹配语法中使用的正则表达式语言版本有一点区别——因为`|`字符可以有特殊的含义,所以必须在awk示例中进行转义.

      为了更深入地理解语法,awk 和 sed 都记录在它们的“手册页”中——请参阅此文档以了解有关语言如何工作的更多信息。

      【讨论】:

      • 完美运行,谢谢!你能解释一下命令以了解它们在做什么吗?
      • 好的...已更新。您将不得不查看手册页以了解每种语言的特定语法的更多信息。
      【解决方案3】:

      如果您取消设置记录分隔符 (RS),awk 会将每个部分视为 一个记录,例如查找其中包含|K 的记录:

      awk '/\|K/' RS=
      

      输出:

      >2654570299|K03427
      MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
      ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
      FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
      VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
      GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
      KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
      ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
      NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI
      >2654570303|K01153
      MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
      HRIKNSKKED
      

      现在,如果您希望输出是双换行符并且只匹配标题,您可以更改字段分隔符(FS)和输出记录分隔符(ORS),例如:

      awk '$1 ~ /\|K/' RS= FS='\n' ORS='\n\n'
      

      输出:

      >2654570299|K03427
      MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
      ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
      FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
      VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
      GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
      KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
      ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
      NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI
      
      >2654570303|K01153
      MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
      HRIKNSKKED
      
      

      【讨论】:

        【解决方案4】:

        我会使用 GNU AWK 来完成这个任务,让 file.txt 内容成为

        >2654570298
        MRNYSYKGKWEKLLTPEIVKKLTLINEFKGEQRLFIKAHKDELKELSELA
        KIQSTEASNKIEGIFTSDDRFKSLAQAKTTPRNRNESEIAGYRDVLNTIH
        DSYEYIPISASYFLQLHRDLYKFVAKNDVGKFKSSDNIIRETDEKGNERL
        RFRPVPAWETPAAIDELCKAYADAKEEIDPLILNAMFILDFLCIHPFNDG
        NGRMSRLLTLLLLYKTGFIVGKYISIEKIIEESKETYYEVLQDSLVGWHE
        NENDYKPFVNYMLGVIVNAYKEFESRTELVTNPNLTKSDRIREIIKDHIG
        TITKAELLEMNPDISDTTVQRTLAKLLKNNDIKKIGGGRYTKYTWNTEEQ
        
        >2654570299|K03427
        MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
        ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
        FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
        VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
        GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
        KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
        ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
        NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI
        
        >2654570301
        MNESELYKELGILTKDKSKWAENIQYVSSLLNHESAKIQAKALWLLGEMG
        LEYPDSIQDAVPMVASFCDSENALLRERAVNALGRIGRGNYNLIEPYWSD
        LFRFASDDEPKVRLSFIWASENVATNTPDIYENHMSVFESLLHDIDDKVR
        MESPEIFRVLGKRRPEFVIPYIEQLQKMAETDSNRVVRIHSLGAIKVTTS
        K
        
        >2654570302
        MWNMIWPLVLIVGSNCFYNICTKSMPEGTNTFGALTVTYLVGAVLSAVLF
        VVSVKPAGVLNEISKINWTSFVLGLVIVGLEAGYVFLYRAGWKVSNGALT
        ANICLAIALIVIGFLLYKESISIKQVAGIVVCGFGLFLING
        
        >2654570303|K01153
        MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
        HRIKNSKKED
        

        然后

        awk 'BEGIN{RS=ORS="\n\n"}index($0,"|K"){print}' file.txt
        

        输出

        >2654570299|K03427
        MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
        ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
        FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
        VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
        GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
        KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
        ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
        NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI
        
        >2654570303|K01153
        MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
        HRIKNSKKED
        

        说明:我将行分隔符和输出行分隔符设置为双换行符,即空白行,因此每个部分都被视为单行。然后我使用index 函数来检查部分是否包含|K。如果没有匹配,此函数返回 0,如果找到匹配的位置。 print 仅在后一种情况下完成。请注意,此函数接受字符串 ("|K") 而不是模式 (/\|K/),因此我不必关心像 | 这样具有特殊含义的字符。如果您想了解更多关于RSORS 或其他内置AWK 变量的信息,请阅读8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

        (在 gawk 4.2.1 中测试)

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2010-12-04
          • 1970-01-01
          • 2018-08-17
          • 1970-01-01
          • 1970-01-01
          • 2011-06-20
          相关资源
          最近更新 更多