使用 Text::CSV_XS 并读取多个 CSV 文件时出现内存不足错误答案

【问题标题】：Out of memory error while using Text::CSV_XS and reading multiple CSV files使用 Text::CSV_XS 并读取多个 CSV 文件时出现内存不足错误
【发布时间】：2014-12-10 08:01:11
【问题描述】：

下面是我打印历史不为空的唯一项目列表的代码。

use strict;
use warnings;
use Text::CSV_XS qw ( csv );

my $q = 0;
my $r = 0;
my @array1;
my @array2;
my @array3;
my %uniqueproject;
my @files = glob("*.csv");
foreach $s (@files) {
    open( my $fh, "<", "$s" ) or die "cannot open the file $!";
    my @aoh = @{ csv( in => $fh, headers => "auto" ) };
    foreach my $i (@aoh) {
        if ( defined( $aoh[$q]{History} ) ) {
            if ( $aoh[$q]{History} ne "" ) {
                $array1[$r] = $aoh[$q]{PROJECT};
                $array2[$r] = $aoh[$q]{IDENTIFIER};
                $r++;
            }
        }
        $q++;
    }
    close($fh);
}
foreach (@array1) {
    $uniqueproject{$_} = 1;
}
@array3 = keys(%uniqueproject);
foreach (@array3) {
    print $_. "\n";
}

如果文件夹中只有一个 CSV，上述代码可以正常工作。对于多个 CSV 文件，我收到内存不足错误。我无法理解此错误的原因。请让我知道是什么填满了内存。如果 foreach 循环不适合遍历文件，建议使用正确的循环。

我的示例 CSV 是

test1.csv：

"SEVERITY","DESCRIPTION","PROJECT","Attachments","priority","IDENTIFIER","STATUS","History","TITLE"
"3","fdlkfjalskfjlskfla
fkdalsfjkasljfkl
dksajdfklsajkl","hadkf","dklsfj/dksfj.dskak/fsajk","4","123","pending","repeat","test csv"
"3","fdlkfjalskfjlskfla
fkdalsfjkasljfkl
dksajdfklsajkl","hadkf","dklsfj/dksfj.dskak/fsajk","4","124","pending","repeat","test csv"
"3","fdlkfjalskfjlskfla
fkdalsfjkasljfkl
dksajdfklsajkl","hadkf","dklsfj/dksfj.dskak/fsajk","4","125","pending","repeat","test csv"
"3","fdlkfjalskfjlskfla
fkdalsfjkasljfkl
dksajdfklsajkl","hadkf","dklsfj/dksfj.dskak/fsajk","4","126","pending","repeat","test csv"

test2.csv：

"SEVERITY","DESCRIPTION","PROJECT","Attachments","priority","IDENTIFIER","STATUS","History","TITLE"
"3","fdlkfjalskfjlskflafkdalsfjkasljfkldksajdfklsajkl","hadkf3","dklsfj/dksfj.dskak/fsajk","4","123","pending","repeat","test csv"
"3","fdlkfjalskfjlskfla
fkdalsfjkasljfkl
dksajdfklsajkl","hadkf4","dklsfj/dksfj.dskak/fsajk","4","124","pending","repeat","test csv"
"3","fdlkfjalskfjlskfla
fkdalsfjkasljfkl
dksajdfklsajkl","hadkf4","dklsfj/dksfj.dskak/fsajk","4","125","pending","repeat","test csv"
"3","fdlkfjalskfjlskfla
fkdalsfjkasljfkl
dksajdfklsajkl","hadkf4","dklsfj/dksfj.dskak/fsajk","4","126","pending","repeat","test csv"

【问题讨论】：

标签： perl csv perl-module

【解决方案1】：

我并不完全清楚您所说的“独特”项目是什么意思，但我假设您正在尝试提取在 History 中有值的所有 ID 和项目。如果是其他问题，您必须编辑您的问题以澄清情况。不幸的是，您提供的测试数据是垃圾，所以我不确定IDENTIFIER 和PROJECT 是否都是唯一的——具有不同 ID 的几行具有相同的 PROJECT 名称。我假设IDENTIFIER 是一个唯一标识符。

use warnings;
use strict;
use Data::Dumper;
use feature ':5.10';

use Text::CSV_XS qw ( csv );

# we will store project info in this hash
my %unique;
my @files = glob("*.csv");

for my $s (@files) {
    open (my $fh, "<","$s") or die "cannot open the file $!";
    my @aoh = @{csv (in => $fh, headers => "auto")};

    # go through the results...
    for (@aoh) {
        # if 'History' is defined and has some content (\w tests for alphanumeric chars)
        if ($_->{History} && $_->{History} =~ /\w/) {
            # add it to the hash of unique projects
            # store the ID as the key and the project name as the value
            $unique{ $_->{IDENTIFIER} } = $_->{PROJECT};
        }
    }
    close ($fh);
}

# now you can go through the hash of projects and print out the ID and project name
for (keys %unique) {
    say "id: $_; project: $unique{$_}";
}

您的代码无法正常工作的原因与您检查项目的方式有关。在每个文件被解析后，您检查了通过解析文件生成的哈希数组，但是使用了数字索引和变量的混合来引用应该是相同的实体。例如：

foreach my $i (@aoh) {
    if ( defined( $aoh[$q]{History} ) ) {
        if ( $aoh[$q]{History} ne "" ) {

在foreach循环中，你不需要引用$aoh[$q]——它已经被$i引用了，所以你可以写if ( defined $i{History} )。使用数字索引成为一个问题，因为您没有在第一个文件之后将其重置为 0，因此当您开始查看文件 2 的结果时，$q 不是 0——它已经设置为结果数从第一个文件。 if (defined $aoh[$q]{History}) 在文件 2 结果中第一次运行时会查看 $aoh[6]{History} 而不是 $aoh[0]{History}！不幸的是，当你搜索$aoh[6]{History} 时，Perl 会自动假定$aoh[6] 存在，如果不存在就会创建它。

如果您将代码修改为以下内容，您可以很好地了解正在发生的事情：

foreach $s (@files) {
    open( my $fh, "<", "$s" ) or die "cannot open the file $!";
    my @aoh = @{ csv( in => $fh, headers => "auto" ) };
    say "Parsed file $s; found " . @aoh . " entries";

    # add an accumulator 
    my $acc = 0;
    foreach my $i (@aoh) {
        say "looking at array entry $acc, aoh length: " . @aoh . "; q: $q; r: $r";
        if ( defined( $aoh[$q]{History} ) ) {
            if ( $aoh[$q]{History} ne "" ) {
                $array1[$r] = $aoh[$q]{PROJECT};
                $array2[$r] = $aoh[$q]{IDENTIFIER};
                $r++;
            }
        }
        $acc++;
        $q++;
        # die after 20 iterations or we'll be here all night!
        die if $acc == 20;
    }
    close($fh);
}

部分输出：

Parsed file file2.csv; found 10 entries
looking at array entry 0, aoh length: 10; q: 12; r: 4
looking at array entry 1, aoh length: 13; q: 13; r: 4
looking at array entry 2, aoh length: 14; q: 14; r: 4
looking at array entry 3, aoh length: 15; q: 15; r: 4
looking at array entry 4, aoh length: 16; q: 16; r: 4
looking at array entry 5, aoh length: 17; q: 17; r: 4
looking at array entry 6, aoh length: 18; q: 18; r: 4
looking at array entry 7, aoh length: 19; q: 19; r: 4
looking at array entry 8, aoh length: 20; q: 20; r: 4
looking at array entry 9, aoh length: 21; q: 21; r: 4
looking at array entry 10, aoh length: 22; q: 22; r: 4

随着您检查的每个条目，数组@aoh 越来越长！

【讨论】：

我不能使用 if (defined $i{History})，我可以使用 if (defined $i->{History})