Perl：将字符推送到数组的散列答案

【问题标题】：Perl: push character to hash of arrayPerl：将字符推送到数组的散列
【发布时间】：2019-08-21 12:19:47
【问题描述】：

我正在计算序列的对数赔率分数并返回给出最高分数的主题（序列的一小部分）。我有计算文件中每个序列的最高分数的代码，但我无法存储给出该分数的主题。有关文件格式、对数赔率分数的一般计算等，请参阅我的其他帖子Perl: Creating and manipulating hash of arrays for log-odds scores of DNA sequences。我的代码如下：

use strict;
use warnings;
use List::Util 'max';
use Data::Dumper; 

#USER SPECIFICATIONS
#User specifies motif width
my $width = 3;

#User enters the filename that contains the sequence data
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;

#Remove newline from file
chomp $filename1;

#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my %maxscore = ();
my %maxmot = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
    if($_ =~ /^>(.+)/)
    {
        $id = $1; #Stores 'Sequence 1' as the first $id, for example
    }
    else
    {
        $HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
        $id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number
        $maxmot{$id} = (); #Creates empty hash to push motifs to
        foreach $id (keys %HoA)
        {
            for my $len (0..(length($HoA{$id})-$width-1))
            {
                push @{ $loscore{$id} }, 0;
            }
        }
        push @{ $maxscore{$id} }, -30; #Creates a HoA with each id number to have a maxscore (initial score -30)
    }
}
close FILE;

foreach $id (keys %id2seq)
{
    print "$id2seq{$id}\n\n";
}
print "\n\n";



#EXPECTATION
#Create log-odds table of motifs
my %logodds;
$logodds{'A'}[0] = 0.1;
$logodds{'A'}[1] = 0.2;
$logodds{'A'}[2] = 0.3;
$logodds{'C'}[0] = 0.2;
$logodds{'C'}[1] = 0.5;
$logodds{'C'}[2] = 0.2;
$logodds{'G'}[0] = 0.3;
$logodds{'G'}[1] = 0.2;
$logodds{'G'}[2] = 0.4;
$logodds{'T'}[0] = 0.4;
$logodds{'T'}[1] = 0.1;
$logodds{'T'}[2] = 0.1;

#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern

foreach $id (keys %HoA)
{   
    for my $pos1 (0..length($HoA{$id})-$width-1)    #Look through all positions the motif can start at
    {
        for my $pos2 ($pos1..$pos1+($width-1))  #Define the positions within the motif (0 to width-1)
        {           
            for my $base (qw( A C G T))
            {
                if ($HoA{$id}[$pos2] eq $base)  #If the character matches a base:
                {
                    for my $pos3 (0..$width-1)  #Used for position reference in %logodds
                    {
                        #Calculate the log-odds score at each location
                        $loscore{$id}[$pos2] += $logodds{$base}[$pos3];

                        #Calculate the maximum log-odds score for each sequence

                        #Find the motif that gives the maximum score for each sequence
                        $maxscore{$id} = max( @{ $loscore{$id} });
                        if ($loscore{$id}[$pos2] == $maxscore{$id})
                        {
                            push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; #NOT SURE THAT THIS IS THE CORRECT WAY TO PUSH WHAT I WANT
                        }
                    }
                }
            }
        }
    }
}

print "\n\n";
print Dumper(\%maxmot);

%maxmot 的预期输出应该是这样的：

'Sequence 11' => [ 'C', 'A', 'T'], 'Sequence 14' => ['C', 'T', 'G'], etc.

预期输出中应该有三个碱基，因为$width = 3。我得到的输出给了我每个基数的倍数，没有任何明显的顺序（或者至少，我没有注意到一个模式）：

'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'], 'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc. 我相信问题的根源在于push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; 步骤，但我不太确定如何解决它。我试过推$HoA{$id}[$pos2]，但这似乎也不能解决我的问题。与往常一样，感谢任何和所有帮助！如果需要，我可以澄清一下，我知道我的问题有点复杂。提前谢谢你。

【问题讨论】：

您希望每个阵列中的碱基顺序如何？
顺序取决于对数赔率得分。所以返回最佳对数赔率分数的顺序就是我想要的顺序。
你为什么有for my $base (qw( A C G T)) { if ($HoA{$id}[$pos2] eq $base)？ my $base = $HoA{$id}[$pos2] 不会简单很多吗？

标签： perl hash push hash-of-hashes

【解决方案1】：

push() 不是问题所在。通过运行您的代码，很明显条件 $loscore{$id}[$pos2] == $maxscore{$id} 是 true 的频率比您预期的要高。

以下是我会在代码审查中提出的一些问题：

为什么在阵列上使用length()？它不返回数组的长度。
for my $base (qw( A C G T)) { if ($HoA{$id}[$pos2] eq $base) {... 不只是等效my $base = $HoA{$id}[$pos2]; 的低效方式吗？
每个$pos2 的计算被执行$pos2 + 1 次，即0 一次，1 两次，...即后面的位置获得更高的分数。
$loscore{$id}[$pos2] 的一个计算是 @{ $logodds{$base} } 的总和，即计算时忽略位置 $pos2 + $pos3 处的基数
您在遍历序列时重新计算$maxscore{$id}并然后在条件中使用变化的值
（我的猜测）一个主题应该是$width 基数，但您的代码仅将单个基数存储到%maxmot

我做出有根据的猜测，并提出以下是正确的算法。我正在使用您在上一个问题中给出的 3 个序列。我还转储了其他 2 个哈希，以便计算结果可见。

我冒昧地重写了您的代码，使其更加简洁明了。但是您应该能够将新代码中的行追溯到您的原始代码。

#!/usr/bin/perl
use strict;
use warnings;

use List::Util 'max';
use Data::Dumper;

my $width = 3;

my %HoA;
my %maxpos;
my %loscore;
my $id = '';
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        $HoA{$id}     = [ split(//) ];
        $maxpos{$id}  = @{ $HoA{$id} } - $width - 1;
        $loscore{$id} = [ (0) x ($maxpos{$id} + 1) ];
    }
}

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);

#MAXIMIZATION
my %maxscore;
my %maxmot;

# Calculate the log-odds score at each location
foreach $id (keys %HoA) {
    for my $index (0..$maxpos{$id}) {
        for my $offset (0..$width-1) {
            # look at base in sequence $id at $offset after $index
            my $base = $HoA{$id}[$index + $offset];
            $loscore{$id}[$index] += $logodds{$base}[$offset];
        }
    }
}

# Calculate the maximum log-odds score for each sequence
foreach $id (keys %HoA) {
    $maxscore{$id} = max( @{ $loscore{$id} });
}

# Find the motif that gives the maximum score for each sequence
foreach $id (keys %HoA) {
    for my $index (0..$maxpos{$id}) {
        if ($loscore{$id}[$index] == $maxscore{$id}) {
            # motif of length $width
            my $motif = join('', @{ $HoA{$id} }[$index..$index + $width - 1]);
            $maxmot{$id}->{$motif}++;
        }
    }
}

print Data::Dumper->Dump([\%loscore, \%maxscore, \%maxmot],
                         [qw(*loscore *maxscore *maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

试运行：

$ perl dummy.pl
%loscore = (
             'Sequence_1' => [
                               '1.2',
                               '0.8',
                               '0.6',
                               '0.8',
                               '0.5',
                               '0.8',
                               '1',
                               '0.8',
                               '0.4',
                               '0.5',
                               '0.8',
                               '0.7',
                               '0.5',
                               '0.9',
                               '0.6',
                               '0.4',
                               '0.3',
                               '0.6',
                               '0.8',
                               '0.7',
                               '0.4',
                               '1.2',
                               '0.5',
                               '0.3',
                               '0.6',
                               '0.7',
                               '1.1',
                               '0.8',
                               '0.4',
                               '0.7',
                               '1',
                               '0.5',
                               '1.1',
                               '1',
                               '0.6',
                               '0.7',
                               '0.5',
                               '1.1',
                               '0.8'
                             ],
             'Sequence_2' => [
                               '0.9',
                               '1',
                               '0.6',
                               '1',
                               '0.6',
                               '1.1',
                               '0.8',
                               '0.5',
                               '1',
                               '1.1',
                               '0.6',
                               '1',
                               '0.9',
                               '0.8',
                               '0.5',
                               '1.1',
                               '0.8',
                               '0.5',
                               '1.1',
                               '0.9',
                               '0.9',
                               '1.1',
                               '0.8',
                               '0.6',
                               '0.6',
                               '1.2',
                               '0.6',
                               '0.7',
                               '0.7',
                               '0.9',
                               '0.7',
                               '0.7',
                               '0.7',
                               '1',
                               '0.6',
                               '0.6',
                               '1.1',
                               '0.8',
                               '0.7'
                             ],
             'Sequence_3' => [
                               '1.3',
                               '0.7',
                               '0.7',
                               '0.8',
                               '0.9',
                               '0.8',
                               '0.5',
                               '1',
                               '0.7',
                               '1',
                               '0.8',
                               '0.8',
                               '0.5',
                               '0.8',
                               '0.8',
                               '0.6',
                               '0.7',
                               '0.4',
                               '1.2',
                               '0.8',
                               '0.7',
                               '0.9',
                               '0.8',
                               '0.7',
                               '0.8',
                               '1',
                               '0.6',
                               '0.9',
                               '0.8',
                               '0.4',
                               '0.6',
                               '1.2',
                               '0.8',
                               '0.5',
                               '1',
                               '1',
                               '0.8',
                               '0.7',
                               '0.7',
                               '1.1',
                               '0.7',
                               '0.7'
                             ]
           );
%maxscore = (
              'Sequence_1' => '1.2',
              'Sequence_3' => '1.3',
              'Sequence_2' => '1.2'
            );
%maxmot = (
            'Sequence_3' => {
                              'TCG' => 1
                            },
            'Sequence_2' => {
                              'TCA' => 1
                            },
            'Sequence_1' => {
                              'TCA' => 2
                            }
          );

这看起来更接近您的预期输出。但当然，我的猜测可能完全不正确......

如果我正确理解了对数分数的计算，那么每个主题的分数是一个常数，因此可以预先计算。这将导致以下更直接的方法：

#!/usr/bin/perl
use strict;
use warnings;

use Data::Dumper;

my $width = 3;

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);

# calculate log score for each motif combination
my $motif_score = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
    my %scores;

    # for all motifs...
    foreach my $prefix (keys %{ $motif_score }) {
        my $base_score = $motif_score->{$prefix};

        # ... add another base to the motif
        for my $base (qw(A G C T)) {
            $scores{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
        }
    }

    # store the scores for the new sequences
    $motif_score = \%scores;
}

#print Data::Dumper->Dump([$motif_score], [qw(motif_score)]);

my $id;
my %maxmot;
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        chomp(my $sequence = $_);
        my $max = -1;

        # run a window of length $width over the sequence
        foreach my $index (0..length($sequence) - $width - 1) {

            # extract the motif at $index from sequence
            my $motif = substr($sequence, $index, $width);
            my $score = $motif_score->{$motif};

            # update max score if the motif has a higher score
            if ($score > $max) {
                $max         = $score;
                $maxmot{$id} = $motif;
            }
        }
    }
}

print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

试运行：

$ perl dummy.pl
%maxmot = (
            'Sequence_2' => 'TCA',
            'Sequence_3' => 'TCG',
            'Sequence_1' => 'TCA'
          );

由于每个motif的logscore是一个常数，所以按logscore顺序排序的motif列表也将是一个常数。鉴于该列表，我们只需找到与给定序列匹配的第一个基序。因此我们可以在这个问题上应用高度优化的正则表达式引擎。根据您的实际问题规模，这可能是更有效的解决方案：

#!/usr/bin/perl
use warnings;
use strict;

use List::Util qw(first pairs);
use Data::Dumper;

my $width = 3;

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);
my @bases = keys %logodds;

# calculate log score for each motif combination
my $motif_logscore = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
    my %score;

    # for all motifs...
    foreach my $prefix (keys %{ $motif_logscore }) {
        my $base_score = $motif_logscore->{$prefix};

        # ... add another base to the motif
        for my $base (@bases) {
            $score{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
        }
    }

    # update hash ref to new motif scores
    $motif_logscore = \%score;
}

#print Data::Dumper->Dump([$motif_logscore], [qw(motif_logscore)]);

my @motifs_sorted =
    # list of [<motif>, <regular expression>] array refs
    map    { [$_->[0], qr/$_->[0]/] }
    # sort in descending numerical score order
    sort   { $b->[1] cmp $a->[1] }
    # list of [<motif>, <score>] array refs
    pairs %{ $motif_logscore };

#print Data::Dumper->Dump([\@motifs_sorted], [qw(*motifs_sorted)]);

my $id;
my %maxmot;
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        my $sequence = $_;
        # find the first pair where the regex matches -> store motif
        $maxmot{$id} = (
            first { ($sequence =~ $_->[1])[0] } @motifs_sorted
        )->[0];
    }
}

print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

【讨论】：

我唯一不能真正解释的是为什么$width 是可配置的。存储在%logodds 中的数组引用也需要长度为$width。但目前它们的长度固定为 3。也许您的算法只适用于宽度为 1、2 和 3 的情况？
从您之前question 中对logscore 计算的解释来看，我认为我的猜测是正确的，上面的答案显示了正确的算法。对于您的示例CAG，日志分数将为0.2 + 0.2 + 0.4。序列 1 有 CAG f.ex。在索引1，因此$loscore{'Sequence_1'}[1] 是0.8。
添加了一个替代实现，基于每个主题的分数是一个常数并且可以按次计算。
$width 在这段代码中不可配置的原因是我想尽量保持帖子的简短。我上次发帖时被告知要这样做。您对对数赔率分数的计算是正确的
我有一种预感，%logodds 看起来很奇怪。那么其中的值是否取决于$width，或者您是否需要为每个序列计算%logodds？我提出的任何解决方案都可以解决您的问题吗？

【解决方案2】：

也许您不需要数组而是哈希？

将推送改为

undef $maxmot{$id}{ $HoA{$id}[$pos3] };

它创建一个散列（具有未定义的值，因此只有键是重要的）。在输出中，我在链接问题的输入中看到每个序列的键不超过 3 个。

【讨论】：

我有一个文件，其中包含我正在使用的 29 个序列。我没有将所有这些都包括在内以试图使问题简短。
我想要一个数组的散列，因为我想稍后在代码中访问每个位置的碱基 - 我将在位置创建每个碱基的计数，操纵计数等。我想数组将允许这样做。不过，我愿意接受更好的方法来做到这一点。
对于计数，您可以使用++$maxmot{$id}{ $HoA{$id}[$pos3] }。我不明白您所说的“将返回最佳对数赔率分数的顺序”是什么意思。
下面是一个例子：假设Sequence 1 是ACCTGA，我们使用的主题宽度为3。自然地，我们知道一个主题的选项是ACC, CCT, CTG, TGA。因此，根据每个对应的分数，这些主题之一将返回最高分数。我想我在这里给出了计算对数赔率分数的更好解释：stackoverflow.com/questions/55326503/…