【发布时间】:2019-08-21 12:19:47
【问题描述】:
我正在计算序列的对数赔率分数并返回给出最高分数的主题(序列的一小部分)。我有计算文件中每个序列的最高分数的代码,但我无法存储给出该分数的主题。有关文件格式、对数赔率分数的一般计算等,请参阅我的其他帖子Perl: Creating and manipulating hash of arrays for log-odds scores of DNA sequences。我的代码如下:
use strict;
use warnings;
use List::Util 'max';
use Data::Dumper;
#USER SPECIFICATIONS
#User specifies motif width
my $width = 3;
#User enters the filename that contains the sequence data
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my %maxscore = ();
my %maxmot = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1; #Stores 'Sequence 1' as the first $id, for example
}
else
{
$HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
$id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number
$maxmot{$id} = (); #Creates empty hash to push motifs to
foreach $id (keys %HoA)
{
for my $len (0..(length($HoA{$id})-$width-1))
{
push @{ $loscore{$id} }, 0;
}
}
push @{ $maxscore{$id} }, -30; #Creates a HoA with each id number to have a maxscore (initial score -30)
}
}
close FILE;
foreach $id (keys %id2seq)
{
print "$id2seq{$id}\n\n";
}
print "\n\n";
#EXPECTATION
#Create log-odds table of motifs
my %logodds;
$logodds{'A'}[0] = 0.1;
$logodds{'A'}[1] = 0.2;
$logodds{'A'}[2] = 0.3;
$logodds{'C'}[0] = 0.2;
$logodds{'C'}[1] = 0.5;
$logodds{'C'}[2] = 0.2;
$logodds{'G'}[0] = 0.3;
$logodds{'G'}[1] = 0.2;
$logodds{'G'}[2] = 0.4;
$logodds{'T'}[0] = 0.4;
$logodds{'T'}[1] = 0.1;
$logodds{'T'}[2] = 0.1;
#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern
foreach $id (keys %HoA)
{
for my $pos1 (0..length($HoA{$id})-$width-1) #Look through all positions the motif can start at
{
for my $pos2 ($pos1..$pos1+($width-1)) #Define the positions within the motif (0 to width-1)
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos2] eq $base) #If the character matches a base:
{
for my $pos3 (0..$width-1) #Used for position reference in %logodds
{
#Calculate the log-odds score at each location
$loscore{$id}[$pos2] += $logodds{$base}[$pos3];
#Calculate the maximum log-odds score for each sequence
#Find the motif that gives the maximum score for each sequence
$maxscore{$id} = max( @{ $loscore{$id} });
if ($loscore{$id}[$pos2] == $maxscore{$id})
{
push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; #NOT SURE THAT THIS IS THE CORRECT WAY TO PUSH WHAT I WANT
}
}
}
}
}
}
}
print "\n\n";
print Dumper(\%maxmot);
%maxmot 的预期输出应该是这样的:
'Sequence 11' => [ 'C', 'A', 'T'],
'Sequence 14' => ['C', 'T', 'G'], etc.
预期输出中应该有三个碱基,因为$width = 3。我得到的输出给了我每个基数的倍数,没有任何明显的顺序(或者至少,我没有注意到一个模式):
'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'],
'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc.
我相信问题的根源在于push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; 步骤,但我不太确定如何解决它。我试过推$HoA{$id}[$pos2],但这似乎也不能解决我的问题。与往常一样,感谢任何和所有帮助!如果需要,我可以澄清一下,我知道我的问题有点复杂。提前谢谢你。
【问题讨论】:
-
您希望每个阵列中的碱基顺序如何?
-
顺序取决于对数赔率得分。所以返回最佳对数赔率分数的顺序就是我想要的顺序。
-
你为什么有
for my $base (qw( A C G T)) { if ($HoA{$id}[$pos2] eq $base)?my $base = $HoA{$id}[$pos2]不会简单很多吗?
标签: perl hash push hash-of-hashes