计算 perl 中每行位置的每个字符的出现次数答案

【问题标题】：count occurrences of every character per every line position in perl计算 perl 中每行位置的每个字符的出现次数
【发布时间】：2015-12-07 17:47:55
【问题描述】：

类似于问题 unix - count occurrences of character per line/field 但对于该行每个位置的每个字符。

给定一个每 1e7 行约 500 个字符的文件，我想要一个二维摘要结构，例如 $summary{'a','b','c','0','1','2'}[pos 0..499] = count_integer 它显示了每个字符在该行的每个位置使用的次数。任何一种尺寸顺序都可以。

我的第一个方法在阅读时执行了 ++summary{char}[pos]，但由于许多行是相同的，首先计算相同的行要快得多，然后总结 summary{char}[pos] += n 一次

是否有比以下类似 C 的 2d 循环更惯用或更快的方法？

#!perl 
my ( %summary, %counthash ); # perl 5.8.9

sub method1 {
    print "method1\n";
    while (<DATA>) {
        my @c = split( // , $_ );
        ++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
    }    # wend
} ## end sub method1

sub method2 {
    print "method2\n";
    ++$counthash{$_} while (<DATA>);    # slurpsum the whole file

    foreach my $str ( keys %counthash ) {  
        my $n = $counthash{$str};
        my @c = split(//, $str);
        $summary{ $c[$_] }[$_] += $n foreach ( 0 .. $#c );
    }    #rof  my $str
} ## end sub method2

# MAINLINE
if (rand() > 0.5) { &method1 } else { &method2 }
print "char $_ : @{$summary{$_}} \n" foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3 
# char b : 2 2 3 3 2 
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa

【问题讨论】：

使用该示例数据很难想象您要查找的内容 - 我假设您的场景不像一行充满重复字符的行那么简单？另外：use strict; use warnings; 是个好主意。
我看到的唯一低效率/非惯用性（？）是您也在计算所有行终止字符（换行符和/或 CR）。（Perl 将它们包含在 $_ 中，除非你做某事。）在每个 <DATA> 读取后插入 chomp;。
@JeffY: unidiomaticity，我相信
这些是DNA序列吗？
真实数据是 TDL，一种使用字符 HLCM01Z 的 VHDL 向量形式，我正在寻找使用哪些引脚/列与静态。我有使用警告；使用严格；在实际程序中，但我忽略了将它们包含在示例程序中以进行发布。索布里克。杰夫·鲍罗丁

标签： perl hashmap idioms

【解决方案1】：

根据数据的形成方式，方法 2 可能比方法 1 更快或更慢。

但一个很大的区别是使用解包而不是拆分。

use strict;
use warnings;
my ( %summary, %counthash ); # perl 5.8.9

sub method1 {
    print "method1\n";
    my @l= <DATA>;
    for  my $t(1..1000000) {
        foreach (@l) {
            my @c = split( // , $_ );
            ++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
        }    
    }    # wend
} ## end sub method1

sub method2 {
    print "method2\n";
    ++$counthash{$_} while (<DATA>);    # slurpsum the whole file
    for  my $t(1..1000000) {
        foreach my $str ( keys %counthash ) {  
            my $n = $counthash{$str};
            my $i = 0;
            $summary{ $_ }[$i++] += $n foreach ( unpack("c*",$str) );
        }    
    }
} ## end sub method2

# MAINLINE
#method1();
method2();
print "char $_ : ". join (" ", @{$summary{ord($_)}}). " \n"
    foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3 
# char b : 2 2 3 3 2 
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa

运行速度更快。（在我的电脑上是 6 而不是 7.x 秒）

【讨论】：

你测试了吗？ {unpack("c*",$str)} 生成错误的汇总键 98 和 97，而不是 'a' 和 'b'； 'a*' 不起作用；这有效： $summary{ $_ }[$i++] += $n foreach ( unpack('a' x length($str),$str) );这也有效 $summary{ chr($_)}[$i++] += $n foreach ( unpack('c*',$str) );
$summary{ substr($str,$_,1) }[$_] += $n foreach ( 0..(length($str)-1)); # 速度一样快
@jgrabber 是的，我做到了，它奏效了。 unpack 只返回 chr 的反转，所以在我的代码中我打印 sumary{ord($_)} 正如你可能已经注意到的那样......但是.. 具有长度和子字符串的解决方案甚至更快。原始代码（执行一百万次）在我的电脑上耗时 7.177 秒，解包解决方案耗时 5.879 秒，长度和子字符串解决方案仅需 4.286 秒。
> 所以在我的代码中我打印 sumary{ord($_)} 你可能已经注意到了。
评论超时：@Georg Just 1 b in jgraber：直到您指出，我才注意到打印中的命令。在一个类似的应用程序中只为用户输入列获取数据，我见过的最快的方法是将一堆 substr() 的代码构建成一个字符串，然后对其进行评估以获得一个编译的子例程，然后调用它。关于 perlidioms，link 演示了使用 map 进行类似循环。也许直到 perl6 才有办法添加到 2d 切片； @summary{ slice} [0..m] ^+= $n x (slice * m);