【发布时间】:2017-10-22 04:13:07
【问题描述】:
我需要在 FASTA 文件中搜索 DNA 序列的某些区域。对于每个匹配,我需要打印序列标题,然后是该序列中的所有匹配。我想打印标题一次,然后是匹配的部分。
以下代码的输出很接近,但匹配的 DNA 区域打印在其标题上方而不是下方。我无法翻转这两个代码块,因为这会切断第一个结果。
# First, I open my file and print a warning if it fails.
unless ( open FILE, "<", '/scratch/SampleDataFiles/test.fasta' ) {
die "Sorry", $!;
}
$/ = ">"; # This changes the record separator from \n to >, so I can chomp it later.
my @file = <FILE>;
my $file = "@file";
chomp $file;
# To view the file I can--
# print $file;
my $count = 0; # here I will count the matched regions
my $sequence_count = 0; # here I will count the sequences
# that contain a matched region
foreach $file ( @file ) {
# I look for each header and its following sequence
# And count the total sequences in the file
if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count + 1;
# Now, I use the sequences I matched and search for a
# hydrophobic region
while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
# I want to know what the position of the match is
my $pos = pos( $sequence ) - 7;
print "\n", $1, " found at ", $pos;
}
# I use the count variable I made earlier to count up each
# time I match a sequence that has one or more hydrophobic region
if ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
print "\n",
"Hydrophobic region(s) found in ",
$head,
"\n",
"-------------------------------------",
"\n";
$count = $count + 1;
}
}
}
print "Hydrophobic region(s) found in ",
$count,
" out of ",
$sequence_count,
" sequences.",
"\n",
"\n";
这是输出:
AVVAAVMW found at 325
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
-------------------------------------
VAVLMLCL found at 170
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
-------------------------------------
Hydrophobic region(s) found in 2 out of 15 sequences.
这是我切换它们时得到的输出:
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
-------------------------------------
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in 2 out of 15 sequences.`
根据老师的建议,我对代码进行了如下调整,以便将所有内容都包含在较大的 while 循环中,并使用计数器限制打印次数。这个新代码会打印每个新的标题一次,然后在它下面打印找到的 DNA 区域的每个实例(基本上是翻转我之前的)。
新代码:
my $count = 0; # here I will count the matched regions
my $temp_count = 0; # this I will use temporarily to count
my $sequence_count = 0; # here I will count the sequences
# that contain a matched region
if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count + 1;
# Now I use the sequences that I found, and
# search them for a hydrophobic region
while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
# I use the count variables I made earlier
# I count all times I match a sequence that has one or more hydrophobic region
$temp_count = $temp_count + 1;
# But I don't want the header repeated for the same sequence, so I limit the
# times that it can print
if ( $temp_count <= 2 ) {
print "\n", "Hydrophobic region(s) found in ", $head, "\n";
$count = $count + 1;
}
# I want to know what the position of the match is
# within the sequence
my $pos = pos( $sequence ) - 7;
print $1, " found at ", $pos, "\n", "\n";
}
}
}
print "\n",
"\n",
"-------------------------",
"\n",
"Hydrophobic region(s) found in ",
$count,
" out of ",
$sequence_count,
" sequences.",
"\n",
"\n";
如果有用,文件如下所示:
>P31946 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWHAB;
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRL
GLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>P62258 | Homo sapiens (Human). | NCBI_TaxID=9606; | 255 | Name=YWHAE;
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIR
LGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ
>Q04917 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWHAH; Synonyms=YWHA1;
MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHP
IRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRK
WETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKSSDRK
GGSYSQAASSDSAQGSDMSLTACKV
>Q156A1 | Homo sapiens (Human). | NCBI_TaxID=9606; | 80 | Name=ATXN8;
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
>Q9UQB9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 309 | Name=AURKC; Synonyms=AIE2, AIK3, ARK3, STK13;
MSSPRAVVQLGKAQPAGEELATANQTAQQPSSPAMRRLTVDDFEIGRPLGKGKFGNVYLARLKESHFIVALKVLFKSQIEKEGLEHQLRREIEIQAHLQHPNILRLYNYFHDARRVYLILEYAPRGELYKELQKSEKLDEQRTATIIEELADALTYCHDKKVIHRDIKPE
NLLLGFRGEVKIADFGWSVHTPSLRRKTMCGTLDYLPPEMIEGRTYDEKVDLWCIGVLCYELLVGYPPFESASHSETYRRILKVDVRFPLSMPLGARDLISRLLRYQPLERLPLAQILKHPWVQAHSRRVLPPCAQMAS
>O75366 | Homo sapiens (Human). | NCBI_TaxID=9606; | 819 | Name=AVIL;
MPLTSAFRAVDNDPGIIVWRIEKMELALVPVSAHGNFYEGDCYVILSTRRVASLLSQDIHFWIGKDSSQDEQSCAAIYTTQLDDYLGGSPVQHREVQYHESDTFRGYFKQGIIYKQGGVASGMKHVETNTYDVKRLLHVKGKRNIRATEVEMSWDSFNRGDVFLLDLGKV
IIQWNGPESNSGERLKAMLLAKDIRDRERGGRAEIGVIEGDKEAASPELMKVLQDTLGRRSIIKPTVPDEIIDQKQKSTIMLYHISDSAGQLAVTEVATRPLVQDLLNHDDCYILDQSGTKIYVWKGKGATKAEKQAAMSKALGFIKMKSYPSSTNVETVNDGAESAMFK
QLFQKWSVKDQTMGLGKTFSIGKIAKVFQDKFDVTLLHTKPEVAAQERMVDDGNGKVEVWRIENLELVPVEYQWYGFFYGGDCYLVLYTYEVNGKPHHILYIWQGRHASQDELAASAYQAVEVDRQFDGAAVQVRVRMGTEPRHFMAIFKGKLVIFEGGTSRKGNAEPDP
PVRLFQIHGNDKSNTKAVEVPAFASSLNSNDVFLLRTQAEHYLWYGKGSSGDERAMAKELASLLCDGSENTVAEGQEPAEFWDLLGGKTPYANDKRLQQEILDVQSRLFECSNKTGQFVVTEITDFTQDDLNPTDVMLLDTWDQVFLWIGAEANATEKESALATAQQYLH
THPSGRDPDTPILIIKQGFEPPIFTGWFLAWDPNIWSAGKTYEQLKEELGDAAAIMRITADMKNATLSLNSNDSEPKYYPIAVLLKNQNQELPEDVNPAKKENYLSEQDFVSVFGITRGQFAALPGWKQLQMKKEKGLF
>Q9UPA5 | Homo sapiens (Human). | NCBI_TaxID=9606; | 3926 | Name=BSN; Synonyms=KIAA0434, ZNF231;
MGNEVSLEGGAGDGPLPPGGAGPGPGPGPGPGAGKPPSAPAGGGQLPAAGAARSTAVPPVPGPGPGPGPGPGPGSTSRRLDPKEPLGNQRAASPTPKQASATTPGHESPRETRAQGPAGQEADGPRRTLQVDSRTQRSGRSPSVSPDRGSTPTSPYSVPQIAPLPSSTLC
PICKTSDLTSTPSQPNFNTCTQCHNKVCNQCGFNPNPHLTQVKEWLCLNCQMQRALGMDMTTAPRSKSQQQLHSPALSPAHSPAKQPLGKPDQERSRGPGGPQPGSRQAETARATSVPGPAQAAAPPEVGRVSPQPPQPTKPSTAEPRPPAGEAPAKSATAVPAGLGATE
QTQEGLTGKLFGLGASLLTQASTLMSVQPEADTQGQPAPSKGTPKIVFNDASKEAGPKPLGSGPGPGPAPGAKTEPGARMGPGSGPGALPKTGGTTSPKHGRAEHQAASKAAAKPKTMPKERAICPLCQAELNVGSKSPANYNTCTTCRLQVCNLCGFNPTPHLVEKTEW
LCLNCQTKRLLEGSLGEPTPLPPPTSQQPPVGAPHRASGTSPLKQKGPQGLGQPSGPLPAKASPLSTKASPLPSKASPQAKPLRASEPSKTPSSVQEKKTRVPTKAEPMPKPPPETTPTPATPKVKSGVRRAEPATPVVKAVPEAPKGGEAEDLVGKPYSQDASRSPQSL
SDTGYSSDGISSSQSEITGVVQQEVEQLDSAGVTGPHPPSPSEIHKVGSSMRPLLQAQGLAPSERSKPLSSGTGEEQKQRPHSLSITPEAFDSDEELEDILEEDEDSAEWRRRREQQDTAESSDDFGSQLRHDYVEDSSEGGLSPLPPQPPARAAELTDEDFMRRQILEM
SAEEDNLEEDDTATSGRGLAKHGTQKGGPRPRPEPSQEPAALPKRRLPHNATTGYEELLPEGGSAEATDGSGTLQGGLRRFKTIELNSTGSYGHELDLGQGPDPSLDREPELEMESLTGSPEDRSRGEHSSTLPASTPSYTSGTSPTSLSSLEEDSDSSPSRRQRLEEAK
QQRKARHRSHGPLLPTIEDSSEEEELREEEELLREQEKMREVEQQRIRSTARKTRRDKEELRAQRRRERSKTPPSNLSPIEDASPTEELRQAAEMEELHRSSCSEYSPSPSLDSEAEALDGGPSRLYKSGSEYNLPTFMSLYSPTETPSGSSTTPSSGRPLKSAEEAYEE
MMRKAELLQRQQGQAAGARGPHGGPSQPTGPRGLGSFEYQDTTDREYGQAAQPAAEGTPASLGAAVYEEILQTSQSIVRMRQASSRDLAFAEDKKKEKQFLNAESAYMDPMKQNGGPLTPGTSPTQLAAPVSFSTPTSSDSSGGRVIPDVRVTQHFAKETQDPLKLHSSP
ASPSSASKEIGMPFSQGPGTPATTAVAPCPAGLPRGYMTPASPAGSERSPSPSSTAHSYGHSPTTANYGSQTEDLPQAPSGLAAAGRAAREKPLSASDGEGGTPQPSRAYSYFASSSPPLSPSSPSESPTFSPGKMGPRATAEFSTQTPSPAPASDMPRSPGAPTPSPMV
AQGTQTPHRPSTPRLVWQESSQEAPFMVITLASDASSQTRMVHASASTSPLCSPTETQPTTHGYSQTTPPSVSQLPPEPPGPPGFPRVPSAGADGPLALYGWGALPAENISLCRISSVPGTSRVEPGPRTPGTAVVDLRTAVKPTPIILTDQGMDLTSLAVEARKYGLAL
DPIPGRQSTAVQPLVINLNAQEHTFLATATTVSITMASSVFMAQQKQPVVYGDPYQSRLDFGQGGGSPVCLAQVKQVEQAVQTAPYRSGPRGRPREAKFARYNLPNQVAPLARRDVLITQMGTAQSIGLKPGPVPEPGAEPHRATPAELRSHALPGARKPHTVVVQMGEG
TAGTVTTLLPEEPAGALDLTGMRPESQLACCDMVYKLPFGSSCTGTFHPAPSVPEKSMADAAPPGQSSSPFYGPRDPEPPEPPTYRAQGVVGPGPHEEQRPYPQGLPGRLYSSMSDTNLAEAGLNYHAQRIGQLFQGPGRDSAMDLSSLKHSYSLGFADGRYLGQGLQYG
SVTDLRHPTDLLAHPLPMRRYSSVSNIYSDHRYGPRGDAVGFQEASLAQYSATTAREISRMCAALNSMDQYGGRHGSGGGGPDLVQYQPQHGPGLSAPQSLVPLRPGLLGNPTFPEGHPSPGNLAQYGPAAGQGTAVRQLLPSTATVRAADGMIYSTINTPIAATLPITT
QPASVLRPMVRGGMYRPYASGGITAVPLTSLTRVPMIAPRVPLGPTGLYRYPAPSRFPIASSVPPAEGPVYLGKPAAAKAPGAGGPSRPEMPVGAAREEPLPTTTPAAIKEAAGAPAPAPLAGQKPPADAAPGGGSGALSRPGFEKEEASQEERQRKQQEQLLQLERERV
ELEKLRQLRLQEELERERVELQRHREEEQLLVQRELQELQTIKHHVLQQQQEERQAQFALQREQLAQQRLQLEQIQQLQQQLQQQLEEQKQRQKAPFPAACEAPGRGPPLAAAELAQNGQYWPPLTHAAFIAMAGPEGLGQPREPVLHRGLPSSASDMSLQTEEQWEASR
SGIKKRHSMPRLRDACELESGTEPCVVRRIADSSVQTDDEDGESRYLLSRRRRARRSADCSVQTDDEDSAEWEQPVRRRRSRLPRHSDSGSDSKHDATASSSSAAATVRAMSSVGIQTISDCSVQTEPDQLPRVSPAIHITAATDPKVEIVRYISAPEKTGRGESLACQT
EPDGQAQGVAGPQLVGPTAISPYLPGIQIVTPGPLGRFEKKKPDPLEIGYQAHLPPESLSQLVSRQPPKSPQVLYSPVSPLSPHRLLDTSFASSERLNKAHVSPQKHFTADSALRQQTLPRPMKTLQRSLSDPKPLSPTAEESAKERFSLYQHQGGLGSQVSALPPNSLV
RKVKRTLPSPPPEEAHLPLAGQASPQLYAASLLQRGLTGPTTVPATKASLLRELDRDLRLVEHESTKLRKKQAELDEEEKEIDAKLKYLELGITQRKESLAKDRGGRDYPPLRGLGEHRDYLSDSELNQLRLQGCTTPAGQFVDFPATAAAPATPSGPTAFQQPRFQPPA
PQYSAGSGGPTQNGFPAHQAPTYPGPSTYPAPAFPPGASYPAEPGLPNQQAFRPTGHYAGQTPMPTTQSTLFPVPADSRAPLQKPRQTSLADLEQKVPTNYEVIASPVVPMSSAPSETSYSGPAVSSGYEQGKVPEVPRAGDRGSVSQSPAPTYPSDSHYTSLEQNVPRN
YVMIDDISELTKDSTSTAPDSQRLEPLGPGSSGRPGKEPGEPGVLDGPTLPCCYARGEEESEEDSYDPRGKGGHLRSMESNGRPASTHYYGDSDYRHGARVEKYGPGPMGPKHPSKSLAPAAISSKRSKHRKQGMEQKISKFSPIEEAKDVESDLASYPPPAVSSSLVSR
GRKFQDEITYGLKKNVYEQQKYYGMSSRDAVEDDRIYGGSSRSRAPSAYSGEKLSSHDFSGWGKGYEREREAVERLQKAGPKPSSLSMAHSRVRPPMRSQASEEESPVSPLGRPRPAGGPLPPGGDTCPQFCSSHSMPDVQEHVKDGPRAHAYKREEGYILDDSHCVVSD
SEAYHLGQEETDWFDKPRDARSDRFRHHGGHAVSSSSQKRGPARHSYHDYDEPPEEGLWPHDEGGPGRHASAKEHRHGDHGRHSGRHTGEEPGRRAAKPHARDLGRHEARPHSQPSSAPAMPKKGQPGYPSSAEYSQPSRASSAYHHASDSKKGSRQAHSGPAALQSKAE
PQAQPQLQGRQAAPGPQQSQSPSSRQIPSGAASRQPQTQQQQQGLGLQPPQQALTQARLQQQSQPTTRGSAPAASQPAGKPQPGPSTATGPQPAGPPRAEQTNGSKGTAKAPQQGRAPQAQPAPGPGPAGVKAGARPGGTPGAPAGQPGADGESVFSKILPGGAAEQAGK
LTEAVSAFGKKFSSFW
>Q9NSI6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 2320 | Name=BRWD1; Synonyms=C21orf107, WDR9;
MAEPSSARRPVPLIESELYFLIARYLSAGPCRRAAQVLVQELEQYQLLPKRLDWEGNEHNRSYEELVLSNKHVAPDHLLQICQRIGPMLDKEIPPSISRVTSLLGAGRQSLLRTAKDCRHTVWKGSAFAALHRGRPPEMPVNYGSPPNLVEIHRGKQLTGCSTFSTAFPG
TMYQHIKMHRRILGHLSAVYCVAFDRTGHRIFTGSDDCLVKIWSTHNGRLLSTLRGHSAEISDMAVNYENTMIAAGSCDKIIRVWCLRTCAPVAVLQGHTGSITSLQFSPMAKGSQRYMVSTGADGTVCFWQWDLESLKFSPRPLKFTEKPRPGVQMLCSSFSVGGMFLA
TGSTDHVIRMYFLGFEAPEKIAELESHTDKVDSIQFCNNGDRFLSGSRDGTARIWRFEQLEWRSILLDMATRISGDLSSEEERFMKPKVTMIAWNQNDSIVVTAVNDHVLKVWNSYTGQLLHNLMGHADEVFVLETHPFDSRIMLSAGHDGSIFIWDITKGTKMKHYFNM
IEGQGHGAVFDCKFSQDGQHFACTDSHGHLLIFGFGCSKPYEKIPDQMFFHTDYRPLIRDSNNYVLDEQTQQAPHLMPPPFLVDVDGNPHPTKYQRLVPGRENSADEHLIPQLGYVATSDGEVIEQIISLQTNDNDERSPESSILDGMIRQLQQQQDQRMGADQDTIPRG
LSNGEETPRRGFRRLSLDIQSPPNIGLRRSGQVEGVRQMHQNAPRSQIATERDLQAWKRRVVVPEVPLGIFRKLEDFRLEKGEEERNLYIIGRKRKTLQLSHKSDSVVLVSQSRQRTCRRKYPNYGRRNRSWRELSSGNESSSSVRHETSCDQSEGSGSSEEDEWRSDRK
SESYSESSSDSSSRYSDWTADAGINLQPPLRTSCRRRITRFCSSSEDEISTENLSPPKRRRKRKKENKPKKENLRRMTPAELANMEHLYEFHPPVWITDTTLRKSPFVPQMGDEVIYFRQGHEAYIEAVRRNNIYELNPNKEPWRKMDLRDQELVKIVGIRYEVGPPTLC
CLKLAFIDPATGKLMDKSFSIRYHDMPDVIDFLVLRQFYDEARQRNWQSCDRFRSIIDDAWWFGTVLSQEPYQPQYPDSHFQCYIVRWDNTEIEKLSPWDMEPIPDNVDPPEELGASISVTTDELEKLLYKPQAGEWGQKSRDEECDRIISGIDQLLNLDIAAAFAGPVD
LCTYPKYCTVVAYPTDLYTIRMRLVNRFYRRLSALVWEVRYIEHNARTFNEPESVIARSAKKITDQLLKFIKNQHCTNISELSNTSENDEQNAEDLDDSDLPKTSSGRRRVHDGKKSIRATNYVESNWKKQCKELVNLIFQCEDSEPFRQPVDLVEYPDYRDIIDTPMDF
GTVRETLDAGNYDSPLEFCKDIRLIFSNAKAYTPNKRSKIYSMTLRLSALFEEKMKKISSDFKIGQKFNEKLRRSQRFKQRQNCKGDSQPNKSIRNLKPKRLKSQTKIIPELVGSPTQSTSSRTAYLGTHKTSAGISSGVTSGDSSDSAESSERRKRNRPITNGSTLSES
EVEDSLATSLSSSASSSSEESKESSRARESSSRSGLSRSSNLRVTRTRAAQRKTGPVSLANGCGRKATRKRVYLSDSDNNSLETGEILKARAGNNRKVLRKCAAVAANKIKLMSDVEENSSSESVCSGRKLPHRNASAVARKKLLHNSEDEQSLKSEIEEEELKDENQPL
PVSSSHTAQSNVDESENRDSESESDLRVARKNWHANGYKSHTPAPSKTKFLKIESSEEDSKSHDSDHACNRTAGPSTSVQKLKAESISEEADSEPGRSGGRKYNTFHKNASFFKKTKILSDSEDSESEEQDREDGKCHKMEMNPISGNLNCDPIAMSQCSSDHGCETDLD
SDDDKIEKPNNFMKDSASQDNGLSRKISRKRVCSSDSDSSLQVVKKSSKARTGLLRITRRCAATAANKIKLMSDVEDVSLENVHTRSKNGRKKPLHLACTTAKKKLSDCEGSVHCEVPSEQYACEGKPPDPDSEGSTKVLSQALNGDSDSEDMLNSEHKHRHTNIHKIDA
PSKRKSSSVTSSGEDSKSHIPGSETDRTFSSESTLAQKATAENNFEVELNYGLRRWNGRRLRTYGKAPFSKTKVIHDSQETAEKEVKRKRSHPELENVKISETTGNSKFRPDTSSKSSDLGSVTESDIDCTDNTKTKRRKTKGKAKVVRKEFVPRDREPNTKVRTCMHNQ
KDAVQMPSETLKAKMVPEKVPRRCATVAANKIKIMSNLKETISGPENVWIRKSSRKLPHRNASAAAKKKLLNVYKEDDTTINSESEKELEDINRKMLFLRGFRSWKENAQ
>Q96KE9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 485 | Name=BTBD6; Synonyms=BDPL;
MAAELYAPASAAAADLANSNAGAAVGRKAGPRSPPSAPAPAPPPPAPAPPTLGNNHQESPGWRCCRPTLRERNALMFNNELMADVHFVVGPPGATRTVPAHKYVLAVGSSVFYAMFYGDLAEVKSEIHIPDVEPAAFLILLKYMYSDEIDLEADTVLATLYAAKKYIVPALAKACVNFLETSLEAKNACVLLSQSRLFEEPELTQRCWEVIDAQAEMALRSEGFCEIDRQTLEIIVTREALNTKEAVVFEAVLNWAEAECKRQGLPITPRNKRHVLGRALYLVRIPTMTLEEFANGAAQSDILTLEETHSIFLWYTATNKPRLDFPLTKRKGLAPQRCHRFQSSAYRSNQWRYRGRCDSIQFAVDRRVFIAGLGLYGSSSGKAEYSVKIELKRLGVVLAQNLTKFMSDGSSNTFPVWFEHPVQVEQDTFYTASAVLDGSELSYFGQEGMTEVQCGKVAFQFQCSSDSTNGTGVQGGQIPELIFYA
>P0C7T9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 278 | Name=BZW1L1;
MENSERNKLAMLTGVLLANGTLNASILNSLYNENLVKEGVSAAFAVKLFKSWINEKDINAVAASLRKVSMDNRLMELFPANKQSVEHFTKYFTEAGLKELSEYVRNQQTIGARKELQKELQEQMSRGDPFKDIILYVKEEMKKNNIPEPVVIGIVWSSVMSTVEWNKKEELVAEQAIKHLKQYSPLLAAFTTQGQSELTLLLKIQEYCYDNIHFMKAFQKIVVLFYKAEVLSEEPILKWYKDAHVAKGKSVFLEQMKKFVEWLKNAEEESESEAEEGD
>Q8IYA2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1237 | Name=CCDC144C;
MVSWGGEKRGGAEGSPKPAVYATRKTGSVRSQEDQWYLGYPGDQWSSGFSYSWWKNSVGSESKHGEGALDQPQHDVRLEDLGELHRAARSGDVPGVEHVLVPGDTGVDKRDRKKSIQQLVPEYKEKQTPESLPQNNNPDWHPTNLTLSDETCQRSKNLKVDDKCPSVSPSMPENQSATKELGQMNLTEREKMDTGVKTSQEPEMAKDCDREDIPIYPVLPHVQKSEEMRIEQGKLEWKNQLKLVINELKQRFGEIYEKYKIPACPEEEPLLDNSTRGTDVKDIPFNLTNNIPGCEEEDASEISVSVVFETFPEQKEPSLKNIIHSYYHPYSGSQEHVCQSSSKLHLHENKLDCDNDNKPGIGHIFSTDKNFHNDASTKKARNPEVVTVEMKEDQEFDLQMTKNMNQNSDSGSTNNYKSLKPKLENLSSLPPDSDRTSEVYLHEELQQDMQKFKNEVNTLEEEFLALKKENVQLHKEVEEEMEKHRSNSTELSGTLTDGTTVGNDDDGLNQQIPRKENGEHDRLALKQENEEKRNADMLYNKDSEQLRIKEEECGKVVETKQQLKWNLRRLVKELRTVVQERNDAQKQLSEEQDARILQDQILTSKQKELEMAQKKRNPEISHRHQKEKDLFHENCMLQEEIALLRLEIDTIKNQNKQKEKKYFEDIEVVKEKNDNLQKIIKRNEETLTETILQYSGQLNNLTAENKMLNSELENGKENQERLEIEMESYRCRLAAAVHDCDQSQTARDLKLDFQRTRQEWVRLHDKMKVDMSGLQAKNEILSEKLSNAESKINSLQIQLHNTRDALGRESLILERVQRDLSQTQCQKKETEQMYQSKLKKYIAKQESVEERLSQLQSENMLLRQQLDDVHKKANSQEKTISTIQDQFHSAAKNLQAESEKQILSLQEKNKELMDEYNHLKERMDQCEKEKAGRKIDLTEAQETVPSRCLHLDAENEVLQLQQTLFSMKAIQKQCETLQKNKKQLKQEVVNLKSYMERNMLERGEAEWHKLLIEERARKEIEEKLNEAILTLQKQAAVSHEQLAQLREDNTTSIKTQMELTVIDLESEISRIKTSQADFNKTKLERYKELYLEEVKVRESLSNELSRTNEMIAEVSTQLTVEKEQTRSRSLFTAYATRPVLESPCVGNLNDSEGLNRKHIPRKKRSALKDMESYLLKMQQKLQNDLTAEVAGSSQTGLHRIPQCSSFSSSSLHLLLCSICQPFFLILQLLLNMNLDPI
>A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
MDGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQPVGPSSPLAPAHFTYPRALQEYQGGSSLPGLGDRAALCSHGSSLSPSPAPSQRDGTWKPPAVQHHVVSVRQERAFQMPKSYSQLIAEWPVAVLMLCLAVIFLCTLAGLLGARLPDFSKPLLGFEPRDTDIGSKLVVWRALQALTGPRKLLFLSPDLELNSSSSHNTLRPAPRGSAQESAVRPRRMVEPLEDRRQENFFCGPPEKSYAKLVFMSTSSGSLWNLHAIHSMCRMEQDQIRSHTSFGALCQRTAANQCCPSWSLGNYLAVLSNRSSCLDTTQADAARTLALLRTCALYYHSGALVPSCLGPGQNKSPRCAQVPTKCSQSSAIYQLLHFLLDRDFLSPQTTDYQVPSLKYSLLFLPTPKGASLMDIYLDRLATPWGLADNYTSVTGMDLGLKQELLRHFLVQDTVYPLLALVAIFFGMALYLRSLFLTLMVLLGVLGSLLVAFFLYQVAFRMAYFPFVNLAALLLLSSVCANHTLIFFDLWRLSKSQLPSGGLAQRVGRTMHHFGYLLLVSGLTTSAAFYASYLSRLPAVRCLALFMGTAVLVHLALTLVWLPASAVLHERYLARGCARRARGRWEGSAPRRLLLALHRRLRGLRRAAAGTSRLLFQRLLPCGVIKFRYIWICWFAALAAGGAYIAGVSPRLRLPTLPPPGGQVFRPSHPFERFDAEYRQLFLFEQLPQGEGGHMPVVLVWGVLPVDTGDPLDPRSNSSLVRDPAFSASGPEAQRWLLALCHRARNQSFFDTLQEGWPTLCFVETLQRWMESPSCARLGPDLCCGHSDFPWAPQFFLHCLKMMALEQGPDGTQDLGLRFDAHGSLAALVLQFQTNFRNSPDYNQTQLFYNEVSHWLAAELGMAPPGLRRGWFTSRLELYSLQHSLSTEPAVVLGLALALAFATLLLGTWNVPLSLFSVAAVAGTVLLTVGLLVLLEWQLNTAEALFLSASVGLSVDFTVNYCISYHLCPHPDRLSRVAFSLRQTSCATAVGAAALFAAGVLMLPATVLLYRKLGIILMMVKCVSCGFASFFFQSLCCFFGPEKNCGQILWPCAHLPWDAGTGDPGGEKAGRPRPGSVGGMPGSCSEQYELQPLARRRSPSFDTSTATSKLSHRPSVLSEDLQLHDGPCCSRPPPAPASPRELLLDHQAVFSQCPALQTSSPYKQAGPSPKTRARQDSQGEEAEPLPASPEAPAHSPKAKAADPPDGFCSSASTLEGLSVSDETCLSTSEPSARVPDSVGVSPDDLDDTGQPVLERGQLNGKRDTLWLALRETVYDPSLPASHHSSLSWKGRGGPGDGSPVVLPNSQPDLPDVWLRRPSTHTSGYSS
>Q96HU8 | Homo sapiens (Human). | NCBI_TaxID=9606; | 199 | Name=DIRAS2;
MPEQSNDYRVAVFGAGGVGKSSLVLRFVKGTFRESYIPTVEDTYRQVISCDKSICTLQITDTTGSHQFPAMQRLSISKGHAFILVYSITSRQSLEELKPIYEQICEIKGDVESIPIMLVGNKCDESPSREVQSSEAEALARTWKCAFMETSAKLNHNVKELFQELLNLEKRRTVSLQIDGKKSKQQKRKEKLKGKCVIM
>Q8N4W6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 341 | Name=DNAJC22;
MAKGLLVTYALWAVGGPAGLHHLYLGRDSHALLWMLTLGGGGLGWLWEFWKLPSFVAQANRAQGQRQSPRGVTPPLSPIRFAAQVIVGIYFGLVALISLSSMVNFYIVALPLAVGLGVLLVAAVGNQTSDFKNTLGSAFLTSPIFYGRPIAILPISVAASITAQRHRRYKALVASEPLSVRLYRLGLAYLAFTGPLAYSALCNTAATLSYVAETFGSFLNWFSFFPLLGRLMEFVLLLPYRIWRLLMGETGFNSSCFQEWAKLYEFVHSFQDEKRQLAYQVLGLSEGATNEEIHRSYQELVKVWHPDHNLDQTEEAQRHFLEIQAAYEVLSQPRKPWGSRR
主要问题
我只是觉得我的原始代码虽然颠倒了,但更可靠,因为我不必告诉它打印标题的次数,它只是自行查找并仅打印唯一的标题。有没有更好的方法只打印标题的新实例,但打印随后所需序列的所有匹配项?我找不到指定仅打印唯一匹配项的方法,并且我不确定是否尝试将所有标头和匹配区域发送到哈希中(我不知道该怎么做)。
【问题讨论】:
-
这可能会让您感到惊讶,但您的问题并不清楚。通常将大量代码转储到 SO 以便有人使它变得更好是行不通的。我建议你大大简化你所拥有的,并提出一个更有针对性的问题。
-
在codereview.stackexchange.com 上发帖可能更合适,因为您的代码中有很多可以改进的地方,但不在 SO 问题的范围内。
-
@xxfelixxx 对不起,如果我不清楚 :-(。我不是要求对我的代码进行全面审查,我现在才编码一个月,我意识到它并不完美。我只是如果需要澄清我的问题,提供代码.我试图显示我想要的输出顺序以及为什么我这样做.我只是在寻找有关搜索包含许多标题的文件的建议,打印每个标题如果其以下数据包含与我的正则表达式匹配,则一次,然后在该标题下方打印所有匹配项,正如我的帖子末尾所说。
-
输入文件的 sn-p 可能有用。
-
我在 codereview 上发帖。我知道这是否是我提出问题的更合适的场所。我认为使用 while 语句与正则表达式匹配可能有一个简单的答案,只打印新匹配项(而不是重复匹配项),而不是使用 if 语句(因为这会阻止我的下一部分打印所有匹配项。跨度>
标签: regex perl header unique fasta