根据列值拆分文件 perl text::csv答案

【问题标题】：Split up files according to column value perl text::csv根据列值拆分文件 perl text::csv
【发布时间】：2015-03-06 17:43:45
【问题描述】：

我之前问过this question 如何使用 AWK 执行此操作，但它并不能很好地处理这一切。数据在带引号的字段中有分号，AWK 没有考虑到这一点。所以我在 perl 中使用 text::csv 模块尝试它，所以我不必考虑这一点。问题是我不知道如何根据列值将其输出到文件中。

上一个问题的简短示例，数据：

10002394;"""22.98""";48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;"""Miami""";http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;Chicago;"""http://testdata.com/bla/28798580.jpg""";5.95;10201848233    
10025825;12.99;65;Chicago;"""http://testdata.com/bla/29017837.jpg""";5.95;93962025367

想要的结果：

File --> 26.csv
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049

File --> 48.csv
10002394;22.98;48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452

File --> 53.csv
10003122;13.0;53;Miami;http://testdata.com/bla/29019899.jpg;5.95;24404000059

File --> 65.csv
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10007645;20.99;65;Chicago;http://testdata.com/bla/28798580.jpg;5.95;10201848233    
10025825;12.99;65;Chicago;http://testdata.com/bla/29017837.jpg;5.95;93962025367

这是我目前所拥有的。 编辑：修改代码：

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
#use Data::Dumper;
use Time::Piece;

my $inputfile  = shift || die "Give input and output names!\n";

open my $infile, '<', $inputfile or die "Sourcefile in use / not found :$!\n";

#binmode($infile, ":encoding(utf8)");

my $csv = Text::CSV_XS->new({binary => 1,sep_char => ";",quote_space => 0,eol => $/});

my %fh;
my %count;
my $country;
my $date = localtime->strftime('%y%m%d');

open(my $fh_report, '>', "report$date.csv");

$csv->getline($infile);

while ( my $elements = $csv->getline($infile)){

EDITED IN:
__________ 
next unless ($elements->[29] =~ m/testdata/);

for (@$elements){
        next if ($elements =~ /apple|orange|strawberry/);
        }
__________

for (@$elements){
        s/\"+/\"/g;
        }

    my $filename = $elements->[2];
    $shop = $elements->[3] .";". $elements->[2];

    $count{$country}++;

        $fh{$filename} ||= do {
            open(my $fh, '>:encoding(UTF-8)', $filename . ".csv") or die "Could not open file '$filename'";
            $fh;
        };

    $csv->print($fh{$filename}, $elements); 
    }

    #print $fh_report Dumper(\%count);
    foreach my $name (reverse sort { $count{$a} <=> $count{$b} or $a cmp $b } keys %count) {
        print $fh_report "$name;$count{$name}\n";
    }

close $fh_report;

错误：

Can't call method "print" on an undefined value at sort_csv_delimiter.pl line 28, <$infile> line 2

我一直在搞砸这个，但我完全不知所措。有人可以帮我吗？

【问题讨论】：

标签： perl csv

【解决方案1】：

我的猜测是你想要缓存文件句柄的哈希，

my %fh;
while ( my $elements = $csv->getline( $infile ) ) {

  my $filename = $elements->[2];

  $fh{$filename} ||= do {
    open my $fh, ">", "$filename.csv" or die $!;
    $fh;
  };

  # $csv->combine(@$elements);
  $csv->print($fh{$filename}, $elements);     
}

【讨论】：

应该是$filename . ".csv"。我还认为打开模式应该是>，否则多次运行脚本会附加额外的数据。此外，由于$elements 永远不会更改，因此无需尝试重新组装阵列。 combine也和string()一起使用，所以这里是多余的。
@TLP 我已经修改了我的代码并结合了您的反馈（当然感谢两者），但现在它给了我 2 个错误。现在看起来一切正常，所以我不明白为什么它会给我错误（见第一篇文章）。你能看到我错过了什么吗？
@Сухой27 谢谢，稍作修改（见第一篇文章）。哦，文件没有排序，所以打开模式不应该是附加模式吗？我不想对其进行排序，因为实际文件为 1.5GB，并且可能需要更多时间？
@Сухой27 已修改但出现新错误（请参阅原始帖子）。还有一点困惑的atm。怎么没定义？
@Сухой27 然后它说：在 sort_csv_delimiter.pl 第 25 行的 void 上下文中无用使用私有变量。这让我再次感到困惑

【解决方案2】：

我没有看到您所说的问题的实例——在带引号的字段中出现了分号分隔符 ;——但你是正确的，Text::CSV 会正确处理它。

这个简短的程序从DATA 文件句柄中读取您的示例数据并将结果打印到STDOUT。如果您愿意，我想您知道如何读取或写入不同的文件。

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });

my @data;

while ( my $row = $csv->getline(\*DATA) ) {
  push @data, $row;
}

my $file;

for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } @data ) {
  unless (defined $file and $file == $row->[2]) {
    $file = $row->[2];
    printf "\nFile --> %d.csv\n", $file;
  }
  $csv->print(\*STDOUT, $row);
}

__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233    
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367

输出

File --> 26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049

File --> 48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452

File --> 53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059

File --> 65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;"10201848233    "
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367

更新

我刚刚意识到您的 “期望结果” 不是您期望看到的输出，而是将单独的记录写入不同文件的方式。这个程序解决了这个问题。

从您的问题看来，您似乎也希望数据按第一个字段的顺序排序，因此我已将所有文件读入内存并将排序版本打印到相关文件。我还使用了autodie 来避免对所有 IO 操作进行状态检查。

use strict;
use warnings;
use autodie;

use Text::CSV;

my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });

my @data;

while ( my $row = $csv->getline(\*DATA) ) {
  push @data, $row;
}

my ($file, $fh);

for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } @data ) {
  unless (defined $file and $file == $row->[2]) {
    $file = $row->[2];
    open $fh, '>', "$file.csv";
  }
  $csv->print($fh, $row);
}

close $fh;

__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233    
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367

【讨论】：

【解决方案3】：

FWIW 我使用 Awk (gawk) 完成了这项工作：

awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }' bigfile.txt

other_process data | awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }'

让我解释一下 awk 脚本：

BEGIN {                          # execution block before reading any file (once)
  if(!(col ~/^[1-9]/)) exit 2;   # assert the `col` variable is a positive number
  outname = "part-%s.txt";       # formatting string of the output file names
}
!/^#/ {                          # only process lines not starting with '#' (header/comments in various data files)
  out = sprintf(outname, $col);  # format the output file name, given the value in column `col`
  print > out;                   # put the line to that file
}

如果您愿意，可以添加一个变量来指定自定义文件名或使用当前文件名（或 STDIN）作为前缀：

NR == 1 {                                                         # at the first file (not BEGIN, as we might need FILENAME)
  if(!(col ~/^[1-9]/)) exit 2;                                    # assert the `col` variable is a positive number
  if(!outname) outname = (FILENAME == "-" ? "STDIN" : FILENAME);  # if `outname` variable was not provided (with `-v/--assign`), use current filename or STDIN
  if(!(outname ~ /%s/)) outname = outname ".%s";                  # if `outname` is not a formatting string - containing %s - append it
}
!/^#/ {                                                           # only process lines not starting with '#' (header/comments in various data files)
  out = sprintf(outname, $col);                                   # format the output file name, given the value in column `col`
  print > out;                                                    # put the line to that file
}

注意：如果您提供多个输入文件，则只有第一个文件的名称将用作输出前缀。要支持多个输入文件和多个前缀，您可以改用FNR == 1 并添加另一个变量来区分用户提供的outname 和自动生成的。

【讨论】：