【问题标题】:Process FASTA file to output header and all matching subsequences处理 FASTA 文件以输出标题和所有匹配的子序列
【发布时间】:2017-10-22 04:13:07
【问题描述】:

我需要在 FASTA 文件中搜索 DNA 序列的某些区域。对于每个匹配,我需要打印序列标题,然后是该序列中的所有匹配。我想打印标题一次,然后是匹配的部分。

以下代码的输出很接近,但匹配的 DNA 区域打印在其标题上方而不是下方。我无法翻转这两个代码块,因为这会切断第一个结果。

# First, I open my file and print a warning if it fails.
unless ( open FILE, "<", '/scratch/SampleDataFiles/test.fasta' ) {
    die "Sorry", $!;
}

$/ = ">";    # This changes the record separator from \n to >, so I can chomp it later.

my @file = <FILE>;
my $file = "@file";
chomp $file;

# To view the file I can--
# print $file;

my $count = 0;    # here I will count the matched regions

my $sequence_count = 0;    # here I will count the sequences
                           # that contain a matched region

foreach $file ( @file ) {

    # I look for each header and its following sequence
    # And count the total sequences in the file
    if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {

        my $head     = $1;
        my $sequence = $2;

        $sequence_count = $sequence_count + 1;

        # Now, I use the sequences I matched and search for a
        # hydrophobic region

        while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {

            # I want to know what the position of the match is
            my $pos = pos( $sequence ) - 7;

            print "\n", $1, " found at ", $pos;
        }

        # I use the count variable I made earlier to count up each
        # time I match a sequence that has one or more hydrophobic region

        if ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {

            print "\n",
                    "Hydrophobic region(s) found in ",
                    $head,
                    "\n",
                    "-------------------------------------",
                    "\n";

            $count = $count + 1;
        }

    }
}

print "Hydrophobic region(s) found in ",
        $count,
        " out of ",
        $sequence_count,
        " sequences.",
        "\n",
        "\n";

这是输出:

AVVAAVMW found at 325
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 365 |    Name=HLA-A; Synonyms=HLAA;

-------------------------------------

VAVLMLCL found at 170
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 1401 |    Name=DISP2; Synonyms=DISPB, KIAA1742;

-------------------------------------
Hydrophobic region(s) found in 2 out of 15 sequences.

这是我切换它们时得到的输出:

Hydrophobic region(s) found in P30450 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 365 |    Name=HLA-A; Synonyms=HLAA;

-------------------------------------

Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 1401 |    Name=DISP2; Synonyms=DISPB, KIAA1742;


LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970

Hydrophobic region(s) found in 2 out of 15 sequences.` 

根据老师的建议,我对代码进行了如下调整,以便将所有内容都包含在较大的 while 循环中,并使用计数器限制打印次数。这个新代码会打印每个新的标题一次,然后在它下面打印找到的 DNA 区域的每个实例(基本上是翻转我之前的)。

新代码:

    my $count      = 0;    # here I will count the matched regions
    my $temp_count = 0;    # this I will use temporarily to count

    my $sequence_count = 0;    # here I will count the sequences
                               # that contain a matched region

    if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {

        my $head     = $1;
        my $sequence = $2;

        $sequence_count = $sequence_count + 1;

        # Now I use the sequences that I found, and
        # search them for a hydrophobic region
        while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {

            # I use the count variables I made earlier
            # I count all times I match a sequence that has one or more hydrophobic region
            $temp_count = $temp_count + 1;

            # But I don't want the header repeated for the same sequence, so I limit the
            # times that it can print
            if ( $temp_count <= 2 ) {
                print "\n", "Hydrophobic region(s) found in ", $head, "\n";
                $count = $count + 1;
            }

            # I want to know what the position of the match is
            # within the sequence
            my $pos = pos( $sequence ) - 7;
            print $1, " found at ", $pos, "\n", "\n";
        }
    }
}

print "\n",
        "\n",
        "-------------------------",
        "\n",
        "Hydrophobic region(s) found in ",
        $count,
        " out of ",
        $sequence_count,
        " sequences.",
        "\n",
        "\n";

如果有用,文件如下所示:

>P31946 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 |    Name=YWHAB;
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRL
GLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>P62258 | Homo sapiens (Human). | NCBI_TaxID=9606; | 255 |    Name=YWHAE;
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIR
LGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ
>Q04917 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 |    Name=YWHAH; Synonyms=YWHA1;
MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHP
IRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 |    Name=HLA-A; Synonyms=HLAA;
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRK
WETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKSSDRK
GGSYSQAASSDSAQGSDMSLTACKV
>Q156A1 | Homo sapiens (Human). | NCBI_TaxID=9606; | 80 |    Name=ATXN8;
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
>Q9UQB9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 309 |    Name=AURKC; Synonyms=AIE2, AIK3, ARK3, STK13;
MSSPRAVVQLGKAQPAGEELATANQTAQQPSSPAMRRLTVDDFEIGRPLGKGKFGNVYLARLKESHFIVALKVLFKSQIEKEGLEHQLRREIEIQAHLQHPNILRLYNYFHDARRVYLILEYAPRGELYKELQKSEKLDEQRTATIIEELADALTYCHDKKVIHRDIKPE
NLLLGFRGEVKIADFGWSVHTPSLRRKTMCGTLDYLPPEMIEGRTYDEKVDLWCIGVLCYELLVGYPPFESASHSETYRRILKVDVRFPLSMPLGARDLISRLLRYQPLERLPLAQILKHPWVQAHSRRVLPPCAQMAS
>O75366 | Homo sapiens (Human). | NCBI_TaxID=9606; | 819 |    Name=AVIL;
MPLTSAFRAVDNDPGIIVWRIEKMELALVPVSAHGNFYEGDCYVILSTRRVASLLSQDIHFWIGKDSSQDEQSCAAIYTTQLDDYLGGSPVQHREVQYHESDTFRGYFKQGIIYKQGGVASGMKHVETNTYDVKRLLHVKGKRNIRATEVEMSWDSFNRGDVFLLDLGKV
IIQWNGPESNSGERLKAMLLAKDIRDRERGGRAEIGVIEGDKEAASPELMKVLQDTLGRRSIIKPTVPDEIIDQKQKSTIMLYHISDSAGQLAVTEVATRPLVQDLLNHDDCYILDQSGTKIYVWKGKGATKAEKQAAMSKALGFIKMKSYPSSTNVETVNDGAESAMFK
QLFQKWSVKDQTMGLGKTFSIGKIAKVFQDKFDVTLLHTKPEVAAQERMVDDGNGKVEVWRIENLELVPVEYQWYGFFYGGDCYLVLYTYEVNGKPHHILYIWQGRHASQDELAASAYQAVEVDRQFDGAAVQVRVRMGTEPRHFMAIFKGKLVIFEGGTSRKGNAEPDP
PVRLFQIHGNDKSNTKAVEVPAFASSLNSNDVFLLRTQAEHYLWYGKGSSGDERAMAKELASLLCDGSENTVAEGQEPAEFWDLLGGKTPYANDKRLQQEILDVQSRLFECSNKTGQFVVTEITDFTQDDLNPTDVMLLDTWDQVFLWIGAEANATEKESALATAQQYLH
THPSGRDPDTPILIIKQGFEPPIFTGWFLAWDPNIWSAGKTYEQLKEELGDAAAIMRITADMKNATLSLNSNDSEPKYYPIAVLLKNQNQELPEDVNPAKKENYLSEQDFVSVFGITRGQFAALPGWKQLQMKKEKGLF
>Q9UPA5 | Homo sapiens (Human). | NCBI_TaxID=9606; | 3926 |    Name=BSN; Synonyms=KIAA0434, ZNF231;
MGNEVSLEGGAGDGPLPPGGAGPGPGPGPGPGAGKPPSAPAGGGQLPAAGAARSTAVPPVPGPGPGPGPGPGPGSTSRRLDPKEPLGNQRAASPTPKQASATTPGHESPRETRAQGPAGQEADGPRRTLQVDSRTQRSGRSPSVSPDRGSTPTSPYSVPQIAPLPSSTLC
PICKTSDLTSTPSQPNFNTCTQCHNKVCNQCGFNPNPHLTQVKEWLCLNCQMQRALGMDMTTAPRSKSQQQLHSPALSPAHSPAKQPLGKPDQERSRGPGGPQPGSRQAETARATSVPGPAQAAAPPEVGRVSPQPPQPTKPSTAEPRPPAGEAPAKSATAVPAGLGATE
QTQEGLTGKLFGLGASLLTQASTLMSVQPEADTQGQPAPSKGTPKIVFNDASKEAGPKPLGSGPGPGPAPGAKTEPGARMGPGSGPGALPKTGGTTSPKHGRAEHQAASKAAAKPKTMPKERAICPLCQAELNVGSKSPANYNTCTTCRLQVCNLCGFNPTPHLVEKTEW
LCLNCQTKRLLEGSLGEPTPLPPPTSQQPPVGAPHRASGTSPLKQKGPQGLGQPSGPLPAKASPLSTKASPLPSKASPQAKPLRASEPSKTPSSVQEKKTRVPTKAEPMPKPPPETTPTPATPKVKSGVRRAEPATPVVKAVPEAPKGGEAEDLVGKPYSQDASRSPQSL
SDTGYSSDGISSSQSEITGVVQQEVEQLDSAGVTGPHPPSPSEIHKVGSSMRPLLQAQGLAPSERSKPLSSGTGEEQKQRPHSLSITPEAFDSDEELEDILEEDEDSAEWRRRREQQDTAESSDDFGSQLRHDYVEDSSEGGLSPLPPQPPARAAELTDEDFMRRQILEM
SAEEDNLEEDDTATSGRGLAKHGTQKGGPRPRPEPSQEPAALPKRRLPHNATTGYEELLPEGGSAEATDGSGTLQGGLRRFKTIELNSTGSYGHELDLGQGPDPSLDREPELEMESLTGSPEDRSRGEHSSTLPASTPSYTSGTSPTSLSSLEEDSDSSPSRRQRLEEAK
QQRKARHRSHGPLLPTIEDSSEEEELREEEELLREQEKMREVEQQRIRSTARKTRRDKEELRAQRRRERSKTPPSNLSPIEDASPTEELRQAAEMEELHRSSCSEYSPSPSLDSEAEALDGGPSRLYKSGSEYNLPTFMSLYSPTETPSGSSTTPSSGRPLKSAEEAYEE
MMRKAELLQRQQGQAAGARGPHGGPSQPTGPRGLGSFEYQDTTDREYGQAAQPAAEGTPASLGAAVYEEILQTSQSIVRMRQASSRDLAFAEDKKKEKQFLNAESAYMDPMKQNGGPLTPGTSPTQLAAPVSFSTPTSSDSSGGRVIPDVRVTQHFAKETQDPLKLHSSP
ASPSSASKEIGMPFSQGPGTPATTAVAPCPAGLPRGYMTPASPAGSERSPSPSSTAHSYGHSPTTANYGSQTEDLPQAPSGLAAAGRAAREKPLSASDGEGGTPQPSRAYSYFASSSPPLSPSSPSESPTFSPGKMGPRATAEFSTQTPSPAPASDMPRSPGAPTPSPMV
AQGTQTPHRPSTPRLVWQESSQEAPFMVITLASDASSQTRMVHASASTSPLCSPTETQPTTHGYSQTTPPSVSQLPPEPPGPPGFPRVPSAGADGPLALYGWGALPAENISLCRISSVPGTSRVEPGPRTPGTAVVDLRTAVKPTPIILTDQGMDLTSLAVEARKYGLAL
DPIPGRQSTAVQPLVINLNAQEHTFLATATTVSITMASSVFMAQQKQPVVYGDPYQSRLDFGQGGGSPVCLAQVKQVEQAVQTAPYRSGPRGRPREAKFARYNLPNQVAPLARRDVLITQMGTAQSIGLKPGPVPEPGAEPHRATPAELRSHALPGARKPHTVVVQMGEG
TAGTVTTLLPEEPAGALDLTGMRPESQLACCDMVYKLPFGSSCTGTFHPAPSVPEKSMADAAPPGQSSSPFYGPRDPEPPEPPTYRAQGVVGPGPHEEQRPYPQGLPGRLYSSMSDTNLAEAGLNYHAQRIGQLFQGPGRDSAMDLSSLKHSYSLGFADGRYLGQGLQYG
SVTDLRHPTDLLAHPLPMRRYSSVSNIYSDHRYGPRGDAVGFQEASLAQYSATTAREISRMCAALNSMDQYGGRHGSGGGGPDLVQYQPQHGPGLSAPQSLVPLRPGLLGNPTFPEGHPSPGNLAQYGPAAGQGTAVRQLLPSTATVRAADGMIYSTINTPIAATLPITT
QPASVLRPMVRGGMYRPYASGGITAVPLTSLTRVPMIAPRVPLGPTGLYRYPAPSRFPIASSVPPAEGPVYLGKPAAAKAPGAGGPSRPEMPVGAAREEPLPTTTPAAIKEAAGAPAPAPLAGQKPPADAAPGGGSGALSRPGFEKEEASQEERQRKQQEQLLQLERERV
ELEKLRQLRLQEELERERVELQRHREEEQLLVQRELQELQTIKHHVLQQQQEERQAQFALQREQLAQQRLQLEQIQQLQQQLQQQLEEQKQRQKAPFPAACEAPGRGPPLAAAELAQNGQYWPPLTHAAFIAMAGPEGLGQPREPVLHRGLPSSASDMSLQTEEQWEASR
SGIKKRHSMPRLRDACELESGTEPCVVRRIADSSVQTDDEDGESRYLLSRRRRARRSADCSVQTDDEDSAEWEQPVRRRRSRLPRHSDSGSDSKHDATASSSSAAATVRAMSSVGIQTISDCSVQTEPDQLPRVSPAIHITAATDPKVEIVRYISAPEKTGRGESLACQT
EPDGQAQGVAGPQLVGPTAISPYLPGIQIVTPGPLGRFEKKKPDPLEIGYQAHLPPESLSQLVSRQPPKSPQVLYSPVSPLSPHRLLDTSFASSERLNKAHVSPQKHFTADSALRQQTLPRPMKTLQRSLSDPKPLSPTAEESAKERFSLYQHQGGLGSQVSALPPNSLV
RKVKRTLPSPPPEEAHLPLAGQASPQLYAASLLQRGLTGPTTVPATKASLLRELDRDLRLVEHESTKLRKKQAELDEEEKEIDAKLKYLELGITQRKESLAKDRGGRDYPPLRGLGEHRDYLSDSELNQLRLQGCTTPAGQFVDFPATAAAPATPSGPTAFQQPRFQPPA
PQYSAGSGGPTQNGFPAHQAPTYPGPSTYPAPAFPPGASYPAEPGLPNQQAFRPTGHYAGQTPMPTTQSTLFPVPADSRAPLQKPRQTSLADLEQKVPTNYEVIASPVVPMSSAPSETSYSGPAVSSGYEQGKVPEVPRAGDRGSVSQSPAPTYPSDSHYTSLEQNVPRN
YVMIDDISELTKDSTSTAPDSQRLEPLGPGSSGRPGKEPGEPGVLDGPTLPCCYARGEEESEEDSYDPRGKGGHLRSMESNGRPASTHYYGDSDYRHGARVEKYGPGPMGPKHPSKSLAPAAISSKRSKHRKQGMEQKISKFSPIEEAKDVESDLASYPPPAVSSSLVSR
GRKFQDEITYGLKKNVYEQQKYYGMSSRDAVEDDRIYGGSSRSRAPSAYSGEKLSSHDFSGWGKGYEREREAVERLQKAGPKPSSLSMAHSRVRPPMRSQASEEESPVSPLGRPRPAGGPLPPGGDTCPQFCSSHSMPDVQEHVKDGPRAHAYKREEGYILDDSHCVVSD
SEAYHLGQEETDWFDKPRDARSDRFRHHGGHAVSSSSQKRGPARHSYHDYDEPPEEGLWPHDEGGPGRHASAKEHRHGDHGRHSGRHTGEEPGRRAAKPHARDLGRHEARPHSQPSSAPAMPKKGQPGYPSSAEYSQPSRASSAYHHASDSKKGSRQAHSGPAALQSKAE
PQAQPQLQGRQAAPGPQQSQSPSSRQIPSGAASRQPQTQQQQQGLGLQPPQQALTQARLQQQSQPTTRGSAPAASQPAGKPQPGPSTATGPQPAGPPRAEQTNGSKGTAKAPQQGRAPQAQPAPGPGPAGVKAGARPGGTPGAPAGQPGADGESVFSKILPGGAAEQAGK
LTEAVSAFGKKFSSFW
>Q9NSI6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 2320 |    Name=BRWD1; Synonyms=C21orf107, WDR9;
MAEPSSARRPVPLIESELYFLIARYLSAGPCRRAAQVLVQELEQYQLLPKRLDWEGNEHNRSYEELVLSNKHVAPDHLLQICQRIGPMLDKEIPPSISRVTSLLGAGRQSLLRTAKDCRHTVWKGSAFAALHRGRPPEMPVNYGSPPNLVEIHRGKQLTGCSTFSTAFPG
TMYQHIKMHRRILGHLSAVYCVAFDRTGHRIFTGSDDCLVKIWSTHNGRLLSTLRGHSAEISDMAVNYENTMIAAGSCDKIIRVWCLRTCAPVAVLQGHTGSITSLQFSPMAKGSQRYMVSTGADGTVCFWQWDLESLKFSPRPLKFTEKPRPGVQMLCSSFSVGGMFLA
TGSTDHVIRMYFLGFEAPEKIAELESHTDKVDSIQFCNNGDRFLSGSRDGTARIWRFEQLEWRSILLDMATRISGDLSSEEERFMKPKVTMIAWNQNDSIVVTAVNDHVLKVWNSYTGQLLHNLMGHADEVFVLETHPFDSRIMLSAGHDGSIFIWDITKGTKMKHYFNM
IEGQGHGAVFDCKFSQDGQHFACTDSHGHLLIFGFGCSKPYEKIPDQMFFHTDYRPLIRDSNNYVLDEQTQQAPHLMPPPFLVDVDGNPHPTKYQRLVPGRENSADEHLIPQLGYVATSDGEVIEQIISLQTNDNDERSPESSILDGMIRQLQQQQDQRMGADQDTIPRG
LSNGEETPRRGFRRLSLDIQSPPNIGLRRSGQVEGVRQMHQNAPRSQIATERDLQAWKRRVVVPEVPLGIFRKLEDFRLEKGEEERNLYIIGRKRKTLQLSHKSDSVVLVSQSRQRTCRRKYPNYGRRNRSWRELSSGNESSSSVRHETSCDQSEGSGSSEEDEWRSDRK
SESYSESSSDSSSRYSDWTADAGINLQPPLRTSCRRRITRFCSSSEDEISTENLSPPKRRRKRKKENKPKKENLRRMTPAELANMEHLYEFHPPVWITDTTLRKSPFVPQMGDEVIYFRQGHEAYIEAVRRNNIYELNPNKEPWRKMDLRDQELVKIVGIRYEVGPPTLC
CLKLAFIDPATGKLMDKSFSIRYHDMPDVIDFLVLRQFYDEARQRNWQSCDRFRSIIDDAWWFGTVLSQEPYQPQYPDSHFQCYIVRWDNTEIEKLSPWDMEPIPDNVDPPEELGASISVTTDELEKLLYKPQAGEWGQKSRDEECDRIISGIDQLLNLDIAAAFAGPVD
LCTYPKYCTVVAYPTDLYTIRMRLVNRFYRRLSALVWEVRYIEHNARTFNEPESVIARSAKKITDQLLKFIKNQHCTNISELSNTSENDEQNAEDLDDSDLPKTSSGRRRVHDGKKSIRATNYVESNWKKQCKELVNLIFQCEDSEPFRQPVDLVEYPDYRDIIDTPMDF
GTVRETLDAGNYDSPLEFCKDIRLIFSNAKAYTPNKRSKIYSMTLRLSALFEEKMKKISSDFKIGQKFNEKLRRSQRFKQRQNCKGDSQPNKSIRNLKPKRLKSQTKIIPELVGSPTQSTSSRTAYLGTHKTSAGISSGVTSGDSSDSAESSERRKRNRPITNGSTLSES
EVEDSLATSLSSSASSSSEESKESSRARESSSRSGLSRSSNLRVTRTRAAQRKTGPVSLANGCGRKATRKRVYLSDSDNNSLETGEILKARAGNNRKVLRKCAAVAANKIKLMSDVEENSSSESVCSGRKLPHRNASAVARKKLLHNSEDEQSLKSEIEEEELKDENQPL
PVSSSHTAQSNVDESENRDSESESDLRVARKNWHANGYKSHTPAPSKTKFLKIESSEEDSKSHDSDHACNRTAGPSTSVQKLKAESISEEADSEPGRSGGRKYNTFHKNASFFKKTKILSDSEDSESEEQDREDGKCHKMEMNPISGNLNCDPIAMSQCSSDHGCETDLD
SDDDKIEKPNNFMKDSASQDNGLSRKISRKRVCSSDSDSSLQVVKKSSKARTGLLRITRRCAATAANKIKLMSDVEDVSLENVHTRSKNGRKKPLHLACTTAKKKLSDCEGSVHCEVPSEQYACEGKPPDPDSEGSTKVLSQALNGDSDSEDMLNSEHKHRHTNIHKIDA
PSKRKSSSVTSSGEDSKSHIPGSETDRTFSSESTLAQKATAENNFEVELNYGLRRWNGRRLRTYGKAPFSKTKVIHDSQETAEKEVKRKRSHPELENVKISETTGNSKFRPDTSSKSSDLGSVTESDIDCTDNTKTKRRKTKGKAKVVRKEFVPRDREPNTKVRTCMHNQ
KDAVQMPSETLKAKMVPEKVPRRCATVAANKIKIMSNLKETISGPENVWIRKSSRKLPHRNASAAAKKKLLNVYKEDDTTINSESEKELEDINRKMLFLRGFRSWKENAQ
>Q96KE9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 485 |    Name=BTBD6; Synonyms=BDPL;
MAAELYAPASAAAADLANSNAGAAVGRKAGPRSPPSAPAPAPPPPAPAPPTLGNNHQESPGWRCCRPTLRERNALMFNNELMADVHFVVGPPGATRTVPAHKYVLAVGSSVFYAMFYGDLAEVKSEIHIPDVEPAAFLILLKYMYSDEIDLEADTVLATLYAAKKYIVPALAKACVNFLETSLEAKNACVLLSQSRLFEEPELTQRCWEVIDAQAEMALRSEGFCEIDRQTLEIIVTREALNTKEAVVFEAVLNWAEAECKRQGLPITPRNKRHVLGRALYLVRIPTMTLEEFANGAAQSDILTLEETHSIFLWYTATNKPRLDFPLTKRKGLAPQRCHRFQSSAYRSNQWRYRGRCDSIQFAVDRRVFIAGLGLYGSSSGKAEYSVKIELKRLGVVLAQNLTKFMSDGSSNTFPVWFEHPVQVEQDTFYTASAVLDGSELSYFGQEGMTEVQCGKVAFQFQCSSDSTNGTGVQGGQIPELIFYA
>P0C7T9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 278 |    Name=BZW1L1;
MENSERNKLAMLTGVLLANGTLNASILNSLYNENLVKEGVSAAFAVKLFKSWINEKDINAVAASLRKVSMDNRLMELFPANKQSVEHFTKYFTEAGLKELSEYVRNQQTIGARKELQKELQEQMSRGDPFKDIILYVKEEMKKNNIPEPVVIGIVWSSVMSTVEWNKKEELVAEQAIKHLKQYSPLLAAFTTQGQSELTLLLKIQEYCYDNIHFMKAFQKIVVLFYKAEVLSEEPILKWYKDAHVAKGKSVFLEQMKKFVEWLKNAEEESESEAEEGD
>Q8IYA2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1237 |    Name=CCDC144C;
MVSWGGEKRGGAEGSPKPAVYATRKTGSVRSQEDQWYLGYPGDQWSSGFSYSWWKNSVGSESKHGEGALDQPQHDVRLEDLGELHRAARSGDVPGVEHVLVPGDTGVDKRDRKKSIQQLVPEYKEKQTPESLPQNNNPDWHPTNLTLSDETCQRSKNLKVDDKCPSVSPSMPENQSATKELGQMNLTEREKMDTGVKTSQEPEMAKDCDREDIPIYPVLPHVQKSEEMRIEQGKLEWKNQLKLVINELKQRFGEIYEKYKIPACPEEEPLLDNSTRGTDVKDIPFNLTNNIPGCEEEDASEISVSVVFETFPEQKEPSLKNIIHSYYHPYSGSQEHVCQSSSKLHLHENKLDCDNDNKPGIGHIFSTDKNFHNDASTKKARNPEVVTVEMKEDQEFDLQMTKNMNQNSDSGSTNNYKSLKPKLENLSSLPPDSDRTSEVYLHEELQQDMQKFKNEVNTLEEEFLALKKENVQLHKEVEEEMEKHRSNSTELSGTLTDGTTVGNDDDGLNQQIPRKENGEHDRLALKQENEEKRNADMLYNKDSEQLRIKEEECGKVVETKQQLKWNLRRLVKELRTVVQERNDAQKQLSEEQDARILQDQILTSKQKELEMAQKKRNPEISHRHQKEKDLFHENCMLQEEIALLRLEIDTIKNQNKQKEKKYFEDIEVVKEKNDNLQKIIKRNEETLTETILQYSGQLNNLTAENKMLNSELENGKENQERLEIEMESYRCRLAAAVHDCDQSQTARDLKLDFQRTRQEWVRLHDKMKVDMSGLQAKNEILSEKLSNAESKINSLQIQLHNTRDALGRESLILERVQRDLSQTQCQKKETEQMYQSKLKKYIAKQESVEERLSQLQSENMLLRQQLDDVHKKANSQEKTISTIQDQFHSAAKNLQAESEKQILSLQEKNKELMDEYNHLKERMDQCEKEKAGRKIDLTEAQETVPSRCLHLDAENEVLQLQQTLFSMKAIQKQCETLQKNKKQLKQEVVNLKSYMERNMLERGEAEWHKLLIEERARKEIEEKLNEAILTLQKQAAVSHEQLAQLREDNTTSIKTQMELTVIDLESEISRIKTSQADFNKTKLERYKELYLEEVKVRESLSNELSRTNEMIAEVSTQLTVEKEQTRSRSLFTAYATRPVLESPCVGNLNDSEGLNRKHIPRKKRSALKDMESYLLKMQQKLQNDLTAEVAGSSQTGLHRIPQCSSFSSSSLHLLLCSICQPFFLILQLLLNMNLDPI
>A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 |    Name=DISP2; Synonyms=DISPB, KIAA1742;
MDGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQPVGPSSPLAPAHFTYPRALQEYQGGSSLPGLGDRAALCSHGSSLSPSPAPSQRDGTWKPPAVQHHVVSVRQERAFQMPKSYSQLIAEWPVAVLMLCLAVIFLCTLAGLLGARLPDFSKPLLGFEPRDTDIGSKLVVWRALQALTGPRKLLFLSPDLELNSSSSHNTLRPAPRGSAQESAVRPRRMVEPLEDRRQENFFCGPPEKSYAKLVFMSTSSGSLWNLHAIHSMCRMEQDQIRSHTSFGALCQRTAANQCCPSWSLGNYLAVLSNRSSCLDTTQADAARTLALLRTCALYYHSGALVPSCLGPGQNKSPRCAQVPTKCSQSSAIYQLLHFLLDRDFLSPQTTDYQVPSLKYSLLFLPTPKGASLMDIYLDRLATPWGLADNYTSVTGMDLGLKQELLRHFLVQDTVYPLLALVAIFFGMALYLRSLFLTLMVLLGVLGSLLVAFFLYQVAFRMAYFPFVNLAALLLLSSVCANHTLIFFDLWRLSKSQLPSGGLAQRVGRTMHHFGYLLLVSGLTTSAAFYASYLSRLPAVRCLALFMGTAVLVHLALTLVWLPASAVLHERYLARGCARRARGRWEGSAPRRLLLALHRRLRGLRRAAAGTSRLLFQRLLPCGVIKFRYIWICWFAALAAGGAYIAGVSPRLRLPTLPPPGGQVFRPSHPFERFDAEYRQLFLFEQLPQGEGGHMPVVLVWGVLPVDTGDPLDPRSNSSLVRDPAFSASGPEAQRWLLALCHRARNQSFFDTLQEGWPTLCFVETLQRWMESPSCARLGPDLCCGHSDFPWAPQFFLHCLKMMALEQGPDGTQDLGLRFDAHGSLAALVLQFQTNFRNSPDYNQTQLFYNEVSHWLAAELGMAPPGLRRGWFTSRLELYSLQHSLSTEPAVVLGLALALAFATLLLGTWNVPLSLFSVAAVAGTVLLTVGLLVLLEWQLNTAEALFLSASVGLSVDFTVNYCISYHLCPHPDRLSRVAFSLRQTSCATAVGAAALFAAGVLMLPATVLLYRKLGIILMMVKCVSCGFASFFFQSLCCFFGPEKNCGQILWPCAHLPWDAGTGDPGGEKAGRPRPGSVGGMPGSCSEQYELQPLARRRSPSFDTSTATSKLSHRPSVLSEDLQLHDGPCCSRPPPAPASPRELLLDHQAVFSQCPALQTSSPYKQAGPSPKTRARQDSQGEEAEPLPASPEAPAHSPKAKAADPPDGFCSSASTLEGLSVSDETCLSTSEPSARVPDSVGVSPDDLDDTGQPVLERGQLNGKRDTLWLALRETVYDPSLPASHHSSLSWKGRGGPGDGSPVVLPNSQPDLPDVWLRRPSTHTSGYSS
>Q96HU8 | Homo sapiens (Human). | NCBI_TaxID=9606; | 199 |    Name=DIRAS2;
MPEQSNDYRVAVFGAGGVGKSSLVLRFVKGTFRESYIPTVEDTYRQVISCDKSICTLQITDTTGSHQFPAMQRLSISKGHAFILVYSITSRQSLEELKPIYEQICEIKGDVESIPIMLVGNKCDESPSREVQSSEAEALARTWKCAFMETSAKLNHNVKELFQELLNLEKRRTVSLQIDGKKSKQQKRKEKLKGKCVIM
>Q8N4W6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 341 |    Name=DNAJC22;
MAKGLLVTYALWAVGGPAGLHHLYLGRDSHALLWMLTLGGGGLGWLWEFWKLPSFVAQANRAQGQRQSPRGVTPPLSPIRFAAQVIVGIYFGLVALISLSSMVNFYIVALPLAVGLGVLLVAAVGNQTSDFKNTLGSAFLTSPIFYGRPIAILPISVAASITAQRHRRYKALVASEPLSVRLYRLGLAYLAFTGPLAYSALCNTAATLSYVAETFGSFLNWFSFFPLLGRLMEFVLLLPYRIWRLLMGETGFNSSCFQEWAKLYEFVHSFQDEKRQLAYQVLGLSEGATNEEIHRSYQELVKVWHPDHNLDQTEEAQRHFLEIQAAYEVLSQPRKPWGSRR

Desired Output

主要问题

我只是觉得我的原始代码虽然颠倒了,但更可靠,因为我不必告诉它打印标题的次数,它只是自行查找并仅打印唯一的标题。有没有更好的方法只打印标题的新实例,但打印随后所需序列的所有匹配项?我找不到指定仅打印唯一匹配项的方法,并且我不确定是否尝试将所有标头和匹配区域发送到哈希中(我不知道该怎么做)。

【问题讨论】:

  • 这可能会让您感到惊讶,但您的问题并不清楚。通常将大量代码转储到 SO 以便有人使它变得更好是行不通的。我建议你大大简化你所拥有的,并提出一个更有针对性的问题。
  • codereview.stackexchange.com 上发帖可能更合适,因为您的代码中有很多可以改进的地方,但不在 SO 问题的范围内。
  • @xxfelixxx 对不起,如果我不清楚 :-(。我不是要求对我的代码进行全面审查,我现在才编码一个月,我意识到它并不完美。我只是如果需要澄清我的问题,提供代码.我试图显示我想要的输出顺序以及为什么我这样做.我只是在寻找有关搜索包含许多标题的文件的建议,打印每个标题如果其以下数据包含与我的正则表达式匹配,则一次,然后在该标题下方打印所有匹配项,正如我的帖子末尾所说。
  • 输入文件的 sn-p 可能有用。
  • 我在 codereview 上发帖。我知道这是否是我提出问题的更合适的场所。我认为使用 while 语句与正则表达式匹配可能有一个简单的答案,只打印新匹配项(而不是重复匹配项),而不是使用 if 语句(因为这会阻止我的下一部分打印所有匹配项。跨度>

标签: regex perl header unique fasta


【解决方案1】:

这就是我要写的。

我做了几项重大更改

  • 我使用chomp从下一个序列的头部删除&gt;,然后检查是否还有一些非空格字符。读取的第一条记录将只是 &gt;,所以这会丢弃空记录

  • 我已经从您的正则表达式模式中删除了分号,因为它们没有太大区别,并添加了一些可选的空格来修剪标题上的任何前导和尾随空格

  • 我已经从[VILMFWCA]{8,} 中删除了非贪婪修饰符?,因为我没有看到它的原因。也许我错了。我还对其进行了更改,以便找到所有匹配的序列,即使它们重叠。再说一次,也许这是一个糟糕的决定:我不是生物信息学家!

  • 将每个区域的位置计算为pos() - 7 是错误的,因为它取决于匹配的长度。我改用了内置数组@-$-[1] 包含捕获开始的位置$1$-[2] 用于$2等。而$-[0]是整场比赛的位置

  • 我将匹配区域及其起始位置保留在数组@regions 中。搜索完成后,我可以通过检查大小来测试是否找到了

use strict;
use warnings 'all';

my $FASTA_FILE = 'test.fasta';


open my $fh, '<', $FASTA_FILE or die qq{Unable to open "$FASTA_FILE" for input: $!};

local $/ = '>';

while ( <$fh> ) {
    chomp;
    next unless /\S/;

    next unless my ( $head, $seq ) = /\s*(.*\S)\s*\n(\w+)/;

    my @regions;

    while ( $seq =~ / (?= ( [VILMFWCA]{8,} ) ) /gxi ) {
        push @regions, [ $1, $-[1] ];
    }

    next unless @regions; # Skip this sequence if no regions found

    printf "%d Hydrophobic region%s found in %s\n",
            scalar @regions,
            @regions == 1 ? "" : "s",
            $head;

    printf "    %s found at %d\n", @$_ for @regions;
}

输出

14 Hydrophobic regions found in A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 |    Name=DISP2; Synonyms=DISPB, KIAA1742; DGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQP
    VAVLMLCLAVIFLC found at 88
    AVLMLCLAVIFLC found at 89
    VLMLCLAVIFLC found at 90
    LMLCLAVIFLC found at 91
    MLCLAVIFLC found at 92
    LCLAVIFLC found at 93
    CLAVIFLC found at 94
    LLALVAIFF found at 411
    LALVAIFF found at 412
    IWICWFAALAA found at 623
    WICWFAALAA found at 624
    ICWFAALAA found at 625
    CWFAALAA found at 626
    LALALAFA found at 888

【讨论】:

    【解决方案2】:

    我所做的是从序列匹配 if 语句正则表达式中删除全局修饰符,但在 while 语句正则表达式之后保留修饰符。这样我就不会像以前那样丢失第一场比赛,但我仍然能够在序列匹配之前打印标题。

        unless (open FILE, "<", '/scratch/SampleDataFiles/test.fasta') {
        die "Cannot Open File", $!;
        }
    
        $/ = ">";       
    
        my @file = <FILE>;
        my $file = "@file";
        chomp $file;
    
        my $count = 0;    
        my $sequence_count = 0;   
    
        foreach $file (@file) {
    
                   if ($file =~ /(.*;.*;?\n)(\w+)/) {
    
                       my $head = $1;
                       my $sequence = $2;
                       $sequence_count = $sequence_count +1;
    
                       if ($sequence =~ /([VILMFWCA]{8,}?)/i) {
    
                                print "\n", "Hydrophobic region(s) found in ", $head, "\n";
                                $count = $count +1;
    
                       }
    
    
                        while ($sequence =~ /([VILMFWCA]{8,}?)/gi) {
    
                                my $pos = pos($sequence)-7;
                                print $1, " found at ", $pos, "\n", "\n";
    
                        }
                   }
            }
    
    
    print "\n", "\n", "-------------------------", "\n", "Hydrophobic region(s) 
    found in ", $count, " out of ", $sequence_count , " sequences.", "\n", "\n";
    
    
    
    close FILE;
    

    谢谢你们的帮助,伙计们!

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-04-04
      • 2015-12-27
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多