提高 Perl 搜索文件脚本的性能答案

【问题标题】：Improve performance of Perl search file script提高 Perl 搜索文件脚本的性能
【发布时间】：2013-09-01 19:42:26
【问题描述】：

我最近注意到，我用 Perl 编写的一个用于 10MB 以下文件的快速脚本已被修改、重新分配并用于 40MB 以上的文本文件，在批处理环境中存在严重的性能问题。

遇到大型文本文件时，作业每次运行大约运行 12 小时，我想知道如何提高代码的性能？我是否应该将文件啜饮到内存中，如果我这样做会破坏工作对文件中行号的依赖。任何建设性的想法将不胜感激，我知道这项工作在文件中循环了太多次，但是如何减少这种情况？

#!/usr/bin/perl
use strict;
use warnings;

my $filename = "$ARGV[0]"; # This is needed for regular batch use 
my $cancfile = "$ARGV[1]"; # This is needed for regular batch use 
my @num =();
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
    push (@num, $.) if (/^P\|/)
}
close FILE;

my $start;
my $end;

my $loop = scalar(@num);
my $counter =1;
my $test;

open (OUTCANC, ">>$cancfile") || error ("Could not open file: ($!)");

#Lets print out the letters minus the CANCEL letters
for ( 1 .. $loop )
{
    $start = shift(@num) if ( ! $start );
    $end = shift(@num);
    my $next = $end;
    $end--;
    my $exclude = "FALSE";

    open(FILE, "<", "$filename") || error("Cannot open file ($!)");
    while (<FILE>)
    {
        my $line = $_;
        $test = $. if ( eof );
        if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
        {
            print OUTCANC "$line";
            $exclude = "TRUECANC";
            next;
        }
        if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
        {
            print OUTCANC "$line";
        } elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
            print $_;
        }
    }
    close FILE;
    $end = ++$test if ( $end < $start );
    $start = $next if ($next);
}


#Lets print the last letter in the file

my $exclude = "FALSE";

open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
    my $line = $_;
    if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
    {
        $exclude = "TRUECANC";
        next;
    }
    if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
    {
        print OUTCANC "$line";
    } elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
        print $_;
    }
}
close FILE;
close OUTCANC;


#----------------------------------------------------------------

sub message
{
    my $m = shift or return;
    print("$m\n");
}

sub error
{
    my $e = shift || 'unknown error';
    print("$0: $e\n");
    exit 0;
}

【问题讨论】：

脚本有什么作用？什么是典型输入，对应的预期输出是什么？
也许您可以减少代码并深入了解代码执行的位置。这样您会得到更好的答案，也许您会自己弄清楚:)
40 MB 文件可以轻松存放在内存中。
你可能想在Code Review 上试试这个，而不是在这里，它的主题更多
exclude 似乎只包含 2 个不同的值，使用 0/1 标志，以便您可以将其值作为布尔值而不是通过每行中的文本匹配来测试（为了可维护性，您当然可以使用清单常量而不是 0/1 文字）。另一种选择可能是以 slurp 模式读取文件（{ local $/ = undef; $fcontent = <FILE>; }，注意包装到新块中）并匹配 $fcontent，使用正则表达式中的 \G 锚点在匹配之间移动。

标签： regex perl perl-data-structures

【解决方案1】：

有一些可以加快脚本速度的东西，比如删除不必要的正则表达式使用。

/^P\|/ 等同于 "P|" eq substr $_, 0, 2。
$foo =~ "BAR" 可以是 -1 != index $foo, "BAR"。

然后有一些重复的代码。将其分解到 sub 本身不会提高性能，但可以更容易地推断脚本的行为。

有很多不必要的字符串化，例如 "$filename" – $filename 就足够了。

但最严重的违规行为是：

for ( 1 .. $loop ) {
  ...
  open FILE, "<", $filename or ...
  while (<FILE>) {
    ...
  }
  ...
}

你只需要一次读取那个文件，最好是读入一个数组。您可以循环遍历索引：

for ( 1 .. $loop ) {
  ...
  for my $i (0 .. $#file_contents) {
    my $line = $file_contents[$i];
    ... # swap $. for $i, but avoid off-by-one error
  }
  ...
}

磁盘 IO 慢，所以尽可能缓存！

我还看到您将$exclude 变量用作布尔值，其值为FALSE 和TRUECANC。为什么不用0 和1，这样可以直接在条件中使用？

您可以在 if/elsif 中排除常见的测试：

if    (FOO && BAR) { THING_A }
elsif (FOO && BAZ) { THING_B }

应该是

if (FOO) {
    if    (BAR) { THING_A }
    elsif (BAZ) { THING_B }
}

$. == $start && $line =~ /^P\|.../ 测试可能很傻，因为$start 只包含以P| 开头的行数——所以这里的正则表达式可能就足够了。

编辑

如果我正确理解了脚本，那么以下内容应该会显着提高性能：

#!/usr/bin/perl
use strict;
use warnings;

my ($filename, $cancfile) = @ARGV;
open my $fh, "<", $filename or die "$0: Couldn't open $filename: $!";

my (@num, @lines);
while (<$fh>)
{
    push @lines, $_;
    push @num, $#lines if "P|" eq substr $_, 0, 2;
}

open my $outcanc, ">>", $cancfile or die "$0: Couldn't open $cancfile: $!";

for my $i ( 0 .. $#num )
{
    my $start = $num[$i];
    my $end   = ($num[$i+1] // @lines) - 1;
    # pre v5.10:
    # my $end = (defined $num[$i+1] ? $num[$i+1] : @lines) - 1

    if ($lines[$start] =~ /^P[|][0-9]{9}[|]1I[|]IR[|]/) {
        print {$outcanc} @lines[$start .. $end];
    } else {
        print STDOUT     @lines[$start .. $end];
    }
}

脚本已清理。该文件缓存在一个数组中。只有数组中真正需要的部分被迭代——我们从之前的O(n·m)下降到O(n)。

对于您未来的脚本：证明循环和变异变量的行为并非不可能，但既乏味又烦人。意识到这一点

for (1 .. @num) {
  $start = shift @num unless $next;  # aka "do this only in the first iteration"
  $next = shift @num:
  $end = $next - 1:
  while (<FH>) {
    ...
    $test = $. if eof
    ...
  }
  $end = ++test if $end < $start;
  $start = $next if $next;
}

实际上就是在第二个shift 中规避可能的undef 需要一些时间。无需在内循环中测试eof，我们只需选择循环后的行号，因此我们不需要$test。然后我们得到：

$start = shift @num;
for my $i (1 .. @num) {
  $end = $num[$i] - 1:

  while (<FH>) { ... }

  $end = $. + 1 if $end < $start;  # $end < $start only true if not defined $num[$i]
  $start = $num[$i] if $num[$i];
}

在将 $i 向下平移 1 后，我们将越界问题限制在一个点：

for my $i (0 .. $#num) {
  $start = $num[$i];
  $end = $num[$i+1] - 1; # HERE: $end = -1 if $i == $#num

  while (<FH>) { ... }
}
$end = $. + 1 if $end < $start;

将文件读取替换为数组后（注意，数组索引和行号之间存在差异），我们看到如果我们将该迭代拉入@987654348，可以避免最终的文件读取循环@loop，因为我们知道总共有多少行。可以这么说，我们做到了

$end = ($num[$i+1] // $last_line_number) - 1;

希望我清理的代码确实和原来的一样。

【讨论】：

我更新了应该更快的清理代码示例。因为我没有测试数据，我不确定它是否正确，所以我不得不退回到容易出错的程序状态的非正式证明。
我认为您确实正确理解了脚本。我在我的开发机器上运行了一些测试，一切都按预期工作。我想针对一些较大的文件运行代码并报告是否可以？我还需要一些时间来处理您的 cmets，感谢您的周到和详细的意见。
代码在 ActivePerl 5.14.2 上运行良好，但在 Solaris 5.8.4 上运行时遇到了一些问题。明天我才有机会再看一遍。
我添加了几个 cmets 和两个关闭的文件句柄，这些文件句柄显然不起作用。该代码适用于 5.10 之前的修订。唯一的问题是它不会将文件中的最后一条记录打印到 STDOUT。我正在使用重定向来捕获 STDOUT，并且正在使用此输出。 $cancfile 只是已删除内容的记录。在 Solaris 服务器上进行测试时，运行时间从 12 多小时减少到几秒钟。一个很好的例子，谢谢！
按预期工作，更改和运行时间现在不到 5 秒，完美！