【问题标题】:Pick up the longest peptide using perl使用 perl 提取最长的肽段
【发布时间】:2019-01-29 22:36:41
【问题描述】:

我想找出从 cds 在 6 个正向和反向帧中翻译的最长可能的蛋白质序列。

这是示例输入格式:

>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK

我想找出从“M”开始到“X”停止的所有字符串,计算每个字符串的长度并选择最长的。

例如上面的例子:

脚本会找到,

>111 has two matches:
MGFSOX
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222 has one match:
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX

然后计算每个匹配的长度,并打印字符串和最长匹配的数量,这是我想要的结果:

>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX    32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX    38

但它没有打印出任何答案。有谁知道如何修理它?任何建议都会有所帮助。

#!/usr/bin/perl -w

use strict;
use warnings;

my @pep=();
my $i=();
my @Xnum=();
my $n=();
my %hash=();
my @k=();
my $seq=();
$n=0;
open(IN, "<$ARGV[0]");
while(<IN>){
        chomp;
        if($_=~/^[^\>]/){
                @pep=split(//, $_);
                if($_ =~ /(X)/){
                        push(@Xnum, $1);
                        if($n >= 0 && $n <= $#Xnum){
                                if(@pep eq "M"){
                                        for($i=1; $i<=$#pep; $i++){
                                                $seq=join("",@pep);
                                                $hash{$i}=$seq;
                                                push(@k, $i);
                                        }
                                }
                                elsif(@pep eq "X"){
                                        $n=$n+1;
                                        }
                                foreach (sort {$a cmp $b} @k){
                                        print "$hash{$k[0]}\t$k[0]";
                                }
                        }
                }
        }
        elsif($_=~/^\>/){
                print "$_\n";
        }

}
close IN;

【问题讨论】:

  • 啊,我很乐意提供帮助,但您必须笼统地解释一下;不是每个人都了解他们的简历。请定义一个清晰的问题陈述。
  • 感谢您的提醒。我已经修改了我的问题。
  • 您介意发布一些示例输入吗?
  • 是的,请勾选第一个框

标签: perl


【解决方案1】:

看看这个 Perl 单行代码

$ cat iris.txt
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK

$ perl -ne ' if(!/^>/) { print "$p"; while(/(M[^M]+?X)/g ) { if(length($1)>length($x)) {$x=$1 }  } print "$x ". length($x)."\n";$x="" } else { $p=$_ }  ' iris.txt
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPX 7

$

【讨论】:

    【解决方案2】:

    有不止一种方法可以做到!

    也试试这个:

    print and next if /^>/;
    chomp and my @z = $_ =~ /(M[^X]*X)/g;
    
    my $m = "";
    for my $s (@z) {
        $m = $s if length $s > length $m
    }
    say "$m\t" . length $m
    

    输出:

    >111
    MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX    32
    >222
    MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX  38
    

    使用 >=5.14 并确保使用 perl -n 运行脚本


    作为单行:

    perl -E 'print and next if /^>/; chomp and my @z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (@z) { $m = $s if length $s > length $m } say "$m\t" . length $m' -n data.txt
    

    【讨论】:

      【解决方案3】:

      这是使用来自List::Utilreduce 的解决方案。

      编辑:错误地使用了maxstr,它给出了结果但不是所需要的。已重新编辑此帖子以改用 reduce(正确)。

      #!/usr/bin/perl
      use strict;
      use warnings;
      use List::Util qw/reduce/;
      
      open my $fh, '<', \<<EOF;
      >111
      KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
      >222
      WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
      EOF
      
      my $id;
      while (<$fh>) {
          chomp;
          if (/^>/) {
              $id = $_;   
          }
          else {
              my $data = reduce {length($a) > length($b) ? $a : $b} /M[^X]*X/g;
              print "$id\n$data\t" . length($data) . "\n" if $data;
          }
      }
      

      【讨论】:

        【解决方案4】:

        这是我的看法。

        我喜欢将 fasta 文件放入哈希中,以 fasta 名称作为键。这样您就可以为其添加描述,例如基础成分等...

        #!/usr/local/ActivePerl-5.20/bin/env perl 
        use strict;
        use warnings;
        my %prot;
        
        open (my $fh, '<', '/Users/me/Desktop/fun_prot.fa') or die $!;
        my $string = do { local $/; <$fh> };
        close $fh;
        chomp $string;
        my @fasta = grep {/./} split (">", $string);
        for my $aa (@fasta){
            my ($key, $value) = split ("\n", $aa);
            $value =~ s/[A-Z]*(M.*M)[A-Z]/$1/;
            $prot{$key}->{'len'} = length($value);
            $prot{$key}->{'prot'} = $value;
            }
        for my $sequence (sort { $prot{$b}->{'len'} <=> $prot{$a}->{'len'} } keys %prot){
            print ">" . $sequence, "\n", $prot{$sequence}->{'prot'}, "\t", $prot{$sequence}->{'len'}, "\n";
            last;
        }     
        __DATA__
        >1232
        ASDFASMJJJJJMFASDFSDAFSDDFSA
        >2343
        AASFDFASMJJJJJJJJJJJJJJMRGQEGDAGDA
        

        输出

        >2343
        MJJJJJJJJJJJJJJM  16
        

        【讨论】:

        • 感谢您的概念!!抱歉这么晚才回复。实际上,我的主要目的是从 RNA-seq 数据中找出最长的异构体来进行域识别。不过好像length()可以在这里解决。
        • 相应修改。
        猜你喜欢
        • 1970-01-01
        • 2018-01-31
        • 1970-01-01
        • 2019-08-29
        • 1970-01-01
        • 2011-02-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多