【发布时间】:2021-12-06 13:06:22
【问题描述】:
我从 phobius 得到这个结果,如下所示
ID sp|Q92673|1-2157
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q5SSG8|25-479
FT DOMAIN 1 455 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q92854|22-734
FT DOMAIN 1 713 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y5E9|27-686
FT DOMAIN 1 660 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y6N8|55-613
FT DOMAIN 1 559 NON CYTOPLASMIC.
//
我希望在每行前面打印由\\分隔的每个结果的对应Uniprot ID。
这是我创建的 perl sn-p
open (MYFILE, "result_phobius.txt" )||warn "Couldn't open file because $!"; #give input file name
open (FILE, ">output.txt"); #output file name
while (<MYFILE>)
{
if ($_=~/^ID (\S+?)\s/) #search accession number started by > and terminate at white space
{
$id=$1;
chomp ($id);
print FILE "$id\t"; #will print accession number in a column
}
if ($_=~/^FT /)
{
print FILE "$_";
}
}
这仅在第一行打印 ID,即,它在具有单个域的结果中工作得非常好,但如果有多个域,则失败。
例如
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
FT TRANSMEM 120 144
FT DOMAIN 145 145 NON CYTOPLASMIC.
我怎样才能使它适用于多个条目。
预期输出
sp|Q92673|1-2157 FT SIGNAL 1 28
sp|Q92673|1-2157 FT DOMAIN 1 11 N-REGION.
sp|Q92673|1-2157 FT DOMAIN 12 22 H-REGION.
sp|Q92673|1-2157 FT DOMAIN 23 28 C-REGION.
sp|Q92673|1-2157 FT DOMAIN 29 2135 NON CYTOPLASMIC.
sp|Q92673|1-2157 FT TRANSMEM 2136 2156
sp|Q92673|1-2157 FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
tr|D3DPA4|1-145 FT TRANSMEM 120 144
tr|D3DPA4|1-145 FT DOMAIN 145 145 NON CYTOPLASMIC.
提前感谢您的帮助
【问题讨论】:
-
这看起来像一个程序,如果你把它写成一个 Unix 过滤器,它会更容易维护和更灵活。删除所有打开的文件。从
STDIN读取,写入STDOUT并像my_program.pl < result_phobius.txt > output.txt一样调用它。
标签: regex perl sequence text-parsing