部分转置 CSV答案

【问题标题】：Partially transposing a CSV部分转置 CSV
【发布时间】：2014-09-17 01:47:28
【问题描述】：

我有一个 CSV 格式的表格导出，但原始表格的结构不适合我的目的：

id,step,field_name,field_value
3,0,field_3,9.43
1,6,field_1,447.74
1,0,field_1,239.09
1,3,field_3,135.84
1,5,field_2,277.33
1,1,field_2,758.71
1,6,field_2,52.14
1,6,field_4,12.24
3,2,field_4,539.89
2,0,field_5,"Smith, John"
1,2,field_4,670.92
2,1,field_3,142.95
3,2,field_2,451.72
1,1,field_3,281.1
1,4,field_2,103.95
1,6,field_3,549.54
1,6,field_5,"Doe, John"
1,2,field_1,5.34
4,0,field_2,1.32
1,7,field_1,94.85
3,1,field_1,90.43
3,2,field_3,578.68
3,2,field_5,"Roe, Jane"
1,1,field_1,5.4
2,0,field_4,507.95

假设field_name 只接受field_1 到field_5 的值，我需要我的数据看起来像这样（最终顺序无关紧要）：

id,step,field_1,field_2,field_3,field_4,field_5
1,0,239.09,,,,
1,1,5.4,758.71,281.1,,
1,2,5.34,,,670.92,
1,3,,,135.84,,
1,4,,103.95,,,
1,5,,277.33,,,
1,6,447.74,52.14,549.54,12.24,"Smith, John"
1,7,,,,,94.85
2,0,,,,507.95,"Doe, John"
2,1,,,142.95,,
3,0,,,9.43,,
3,1,90.43,,,,
3,2,,451.72,578.68,539.89,"Roe, Jane"
4,0,,1.32,,,

我的第一步是对文件进行排序，以便我可以转置行块：

sort -k1,1n -k2,2n -o sample.csv sample.csv

现在我正在尝试构建一个 Perl 脚本来完成这项工作，但我是 Perl 的新手。这是我的（可怕的）尝试：

use strict;
use warnings;
use 5.010;
use File::Copy;
use Text::CSV;

my $csv = Text::CSV->new({
    binary => 1,
    auto_diag => 1,
    eol => $/,
    always_quote => 1
}) or die 'Cannot use CSV: ' . Text::CSV->error_diag();

my $file = 'sample.csv';
my $backup = "$file.bak";
copy $file, $backup or die "Copy failed: $!";

open my $in_fh, '<', $backup or die "$backup: $!";
open my $out_fh, '>', $file or die "$file: $!";

my $loop = 1;
my $row = $csv->getline($in_fh);
my $next_row = $row;
while ($loop) {
    my @text = @$row[0,1]
    while (@$row[0] == @$next_row[0]) {
        my $pos substr $row[2], -1;
        @text[$pos + 1] = @$row[3];
        $row = $next_row;
        my $next_row = $csv->getline($in_fh)
    }
    $csv->print($out_fh, \@text);
}

close $in_fh;
close $out_fh;

【问题讨论】：

标签： perl csv

【解决方案1】：

即使您有超出field_5 的字段，以下内容也可以使用，尽管它假定您希望自然地对它们进行排序。数据不必提前排序；但是，所有内容都存储在哈希中，因此如果您的 CSV 很大，这将占用大量内存。我只是打印到STDOUT，但您可以轻松修改它以打印到文件。

use strict;
use warnings;

use Sort::Naturally;
use Text::CSV;

my $csv = Text::CSV->new({
    binary => 1,
    auto_diag => 1,
    eol => $/,
}) or die 'Cannot use CSV: ' . Text::CSV->error_diag();

my $fh = \*DATA;

my $header = $csv->getline($fh);

my (%data, %fields);
while ( my $row = $csv->getline($fh) ) {
    $data{ $row->[0] }{ $row->[1] }{ $row->[2] } = $row->[3];

    # Keep track of unique field names
    $fields{ $row->[2] } = 1;
}

# Order the additional columns
my @sorted = nsort keys %fields;

# Print header
$csv->print(\*STDOUT, [ $header->[0], $header->[1], @sorted ]);

foreach my $id ( sort { $a <=> $b } keys %data ) {
    foreach my $step ( sort { $a <=> $b } keys %{ $data{$id} } ) {
        my $results = [ $id, $step, @{ $data{$id}{$step} }{ @sorted } ];
        $csv->print(\*STDOUT, $results);
    }
}

__DATA__
id,step,field_name,field_value
3,0,field_3,9.43
1,6,field_1,447.74
1,0,field_1,239.09
1,3,field_3,135.84
1,5,field_2,277.33
1,1,field_2,758.71
1,6,field_2,52.14
1,6,field_4,12.24
3,2,field_4,539.89
2,0,field_5,"Smith, John"
1,2,field_4,670.92
2,1,field_3,142.95
3,2,field_2,451.72
1,1,field_3,281.1
1,4,field_2,103.95
1,6,field_3,549.54
1,6,field_5,"Doe, John"
1,2,field_1,5.34
4,0,field_2,1.32
1,7,field_1,94.85
3,1,field_1,90.43
3,2,field_3,578.68
3,2,field_5,"Roe, Jane"
1,1,field_1,5.4
2,0,field_4,507.95

输出：

id,step,field_1,field_2,field_3,field_4,field_5
1,0,239.09,,,,
1,1,5.4,758.71,281.1,,
1,2,5.34,,,670.92,
1,3,,,135.84,,
1,4,,103.95,,,
1,5,,277.33,,,
1,6,447.74,52.14,549.54,12.24,"Doe, John"
1,7,94.85,,,,
2,0,,,,507.95,"Smith, John"
2,1,,,142.95,,
3,0,,,9.43,,
3,1,90.43,,,,
3,2,,451.72,578.68,539.89,"Roe, Jane"
4,0,,1.32,,,

【讨论】：

谢谢，这很好，它甚至可以计算出字段名称！
@ThisSuitIsBlackNot 我强烈建议放弃Sort::Naturally，转而使用Sort::Key::Natural。如果您尝试对以下列表进行排序，则可以观察到前者的实现错误：qw(1:200 2:7)。
@Miller 回顾 Sort::Naturally 的文档，我看到“\W 子字符串（既不是单词字符也不是数字）被忽略了。”感谢您指出这一点，我明天必须更新我的答案。

【解决方案2】：

我实际上是说你可能想跳过使用 Text::CSV 来做这个，而是这样做：

while ( <$input_fh> ) {
    my ( $id, $step, $field_name, @field_values ) = split ( /,/ );
    print {$output_fh} "$id,$step,";
    if ( $field_name eq "field_1" ) { print {$output_fh} "," };
    if ( $field_name eq "field_2" ) { print {$output_fh} ",," };

     #etc.

    print {$output_fh} join(",", @field_values),"\n";

}

您可能可以使用查找表来查找字段名称的列数，但我不确定它会改进多少。

【讨论】：

不能用逗号分割，因为数据中有逗号。
啊，公平点，是的。可能仍需要 Text::CSV 来解析输入流。不过，根据示例数据修改了解决方法。
您的编辑没有正确地用逗号引用字段，因此某些行会有额外的列。

【解决方案3】：

这是一种使用 perl 哈希的方法：

#!/usr/bin/perl

use strict;
use warnings;

my %records ;

# Build the Hash, doesn't matter if there's holes in the steps.
while ( <> ) {
  chomp;

  my ($id,$step,$field_name,@field_value) = split(",") ;
  my ($garbage, $field_number) = split("_", $field_name) ;

  $records{$id.",".$step}{$field_number} = join(",",@field_value );

}

my $line ;

foreach my $id_step (keys %records) {
    $line = "$id_step"  ;

# For every step, see if there's a value in the hash and print it, otherwise, empty field.
  for(my $field_number = 0; $field_number < 6 ; $field_number++) {
    if (exists $records{$id_step}{$field_number}) { $line = $line . "$records{$id_step}{$field_number}," ; next }
    else {
        $line = $line . ","  ;
    }
  chop $line ;
  print $line . "\n" ;
}

print "\n" ;

通过排序运行时会给出以下输出：

1,0,239.09,,,,
1,1,5.4,758.71,281.1,,
1,2,5.34,,,670.92,
1,3,,,135.84,,
1,4,,103.95,,,
1,5,,277.33,,,
1,6,447.74,52.14,549.54,12.24,"Doe, John"
1,7,94.85,,,,
2,0,,,,507.95,"Smith, John"
2,1,,,142.95,,
3,0,,,9.43,,
3,1,90.43,,,,
3,2,,451.72,578.68,539.89,"Roe, Jane"
4,0,,1.32,,,

【讨论】：

输出与 OP 的预期输出不匹配。您没有生成有效的 CSV；每行应该有相同的列数。也没有指示特定值属于哪个字段。第二行中的670.92 是否对应于field_2、field_3、field_4 或field_5？
您现在的输出中有太多列。您还没有包含标题行。