【问题标题】:Partially transposing a CSV部分转置 CSV
【发布时间】:2014-09-17 01:47:28
【问题描述】:

我有一个 CSV 格式的表格导出,但原始表格的结构不适合我的目的:

id,step,field_name,field_value
3,0,field_3,9.43
1,6,field_1,447.74
1,0,field_1,239.09
1,3,field_3,135.84
1,5,field_2,277.33
1,1,field_2,758.71
1,6,field_2,52.14
1,6,field_4,12.24
3,2,field_4,539.89
2,0,field_5,"Smith, John"
1,2,field_4,670.92
2,1,field_3,142.95
3,2,field_2,451.72
1,1,field_3,281.1
1,4,field_2,103.95
1,6,field_3,549.54
1,6,field_5,"Doe, John"
1,2,field_1,5.34
4,0,field_2,1.32
1,7,field_1,94.85
3,1,field_1,90.43
3,2,field_3,578.68
3,2,field_5,"Roe, Jane"
1,1,field_1,5.4
2,0,field_4,507.95

假设field_name 只接受field_1field_5 的值,我需要我的数据看起来像这样(最终顺序无关紧要):

id,step,field_1,field_2,field_3,field_4,field_5
1,0,239.09,,,,
1,1,5.4,758.71,281.1,,
1,2,5.34,,,670.92,
1,3,,,135.84,,
1,4,,103.95,,,
1,5,,277.33,,,
1,6,447.74,52.14,549.54,12.24,"Smith, John"
1,7,,,,,94.85
2,0,,,,507.95,"Doe, John"
2,1,,,142.95,,
3,0,,,9.43,,
3,1,90.43,,,,
3,2,,451.72,578.68,539.89,"Roe, Jane"
4,0,,1.32,,,

我的第一步是对文件进行排序,以便我可以转置行块:

sort -k1,1n -k2,2n -o sample.csv sample.csv

现在我正在尝试构建一个 Perl 脚本来完成这项工作,但我是 Perl 的新手。这是我的(可怕的)尝试:

use strict;
use warnings;
use 5.010;
use File::Copy;
use Text::CSV;

my $csv = Text::CSV->new({
    binary => 1,
    auto_diag => 1,
    eol => $/,
    always_quote => 1
}) or die 'Cannot use CSV: ' . Text::CSV->error_diag();

my $file = 'sample.csv';
my $backup = "$file.bak";
copy $file, $backup or die "Copy failed: $!";

open my $in_fh, '<', $backup or die "$backup: $!";
open my $out_fh, '>', $file or die "$file: $!";

my $loop = 1;
my $row = $csv->getline($in_fh);
my $next_row = $row;
while ($loop) {
    my @text = @$row[0,1]
    while (@$row[0] == @$next_row[0]) {
        my $pos substr $row[2], -1;
        @text[$pos + 1] = @$row[3];
        $row = $next_row;
        my $next_row = $csv->getline($in_fh)
    }
    $csv->print($out_fh, \@text);
}

close $in_fh;
close $out_fh;

【问题讨论】:

    标签: perl csv


    【解决方案1】:

    即使您有超出field_5 的字段,以下内容也可以使用,尽管它假定您希望自然地对它们进行排序。数据不必提前排序;但是,所有内容都存储在哈希中,因此如果您的 CSV 很大,这将占用大量内存。我只是打印到STDOUT,但您可以轻松修改它以打印到文件。

    use strict;
    use warnings;
    
    use Sort::Naturally;
    use Text::CSV;
    
    my $csv = Text::CSV->new({
        binary => 1,
        auto_diag => 1,
        eol => $/,
    }) or die 'Cannot use CSV: ' . Text::CSV->error_diag();
    
    my $fh = \*DATA;
    
    my $header = $csv->getline($fh);
    
    my (%data, %fields);
    while ( my $row = $csv->getline($fh) ) {
        $data{ $row->[0] }{ $row->[1] }{ $row->[2] } = $row->[3];
    
        # Keep track of unique field names
        $fields{ $row->[2] } = 1;
    }
    
    # Order the additional columns
    my @sorted = nsort keys %fields;
    
    # Print header
    $csv->print(\*STDOUT, [ $header->[0], $header->[1], @sorted ]);
    
    foreach my $id ( sort { $a <=> $b } keys %data ) {
        foreach my $step ( sort { $a <=> $b } keys %{ $data{$id} } ) {
            my $results = [ $id, $step, @{ $data{$id}{$step} }{ @sorted } ];
            $csv->print(\*STDOUT, $results);
        }
    }
    
    __DATA__
    id,step,field_name,field_value
    3,0,field_3,9.43
    1,6,field_1,447.74
    1,0,field_1,239.09
    1,3,field_3,135.84
    1,5,field_2,277.33
    1,1,field_2,758.71
    1,6,field_2,52.14
    1,6,field_4,12.24
    3,2,field_4,539.89
    2,0,field_5,"Smith, John"
    1,2,field_4,670.92
    2,1,field_3,142.95
    3,2,field_2,451.72
    1,1,field_3,281.1
    1,4,field_2,103.95
    1,6,field_3,549.54
    1,6,field_5,"Doe, John"
    1,2,field_1,5.34
    4,0,field_2,1.32
    1,7,field_1,94.85
    3,1,field_1,90.43
    3,2,field_3,578.68
    3,2,field_5,"Roe, Jane"
    1,1,field_1,5.4
    2,0,field_4,507.95
    

    输出:

    id,step,field_1,field_2,field_3,field_4,field_5
    1,0,239.09,,,,
    1,1,5.4,758.71,281.1,,
    1,2,5.34,,,670.92,
    1,3,,,135.84,,
    1,4,,103.95,,,
    1,5,,277.33,,,
    1,6,447.74,52.14,549.54,12.24,"Doe, John"
    1,7,94.85,,,,
    2,0,,,,507.95,"Smith, John"
    2,1,,,142.95,,
    3,0,,,9.43,,
    3,1,90.43,,,,
    3,2,,451.72,578.68,539.89,"Roe, Jane"
    4,0,,1.32,,,
    

    【讨论】:

    • 谢谢,这很好,它甚至可以计算出字段名称!
    • @ThisSuitIsBlackNot 我强烈建议放弃Sort::Naturally,转而使用Sort::Key::Natural。如果您尝试对以下列表进行排序,则可以观察到前者的实现错误:qw(1:200 2:7)
    • @Miller 回顾 Sort::Naturally 的文档,我看到“\W 子字符串(既不是单词字符也不是数字)被忽略了。”感谢您指出这一点,我明天必须更新我的答案。
    【解决方案2】:

    我实际上是说你可能想跳过使用 Text::CSV 来做这个,而是这样做:

    while ( <$input_fh> ) {
        my ( $id, $step, $field_name, @field_values ) = split ( /,/ );
        print {$output_fh} "$id,$step,";
        if ( $field_name eq "field_1" ) { print {$output_fh} "," };
        if ( $field_name eq "field_2" ) { print {$output_fh} ",," };
    
         #etc.
    
        print {$output_fh} join(",", @field_values),"\n";
    

    }

    您可能可以使用查找表来查找字段名称的列数,但我不确定它会改进多少。

    【讨论】:

    • 不能用逗号分割,因为数据中有逗号。
    • 啊,公平点,是的。可能仍需要 Text::CSV 来解析输入流。不过,根据示例数据修改了解决方法。
    • 您的编辑没有正确地用逗号引用字段,因此某些行会有额外的列。
    【解决方案3】:

    这是一种使用 perl 哈希的方法:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my %records ;
    
    # Build the Hash, doesn't matter if there's holes in the steps.
    while ( <> ) {
      chomp;
    
      my ($id,$step,$field_name,@field_value) = split(",") ;
      my ($garbage, $field_number) = split("_", $field_name) ;
    
      $records{$id.",".$step}{$field_number} = join(",",@field_value );
    
    }
    
    my $line ;
    
    foreach my $id_step (keys %records) {
        $line = "$id_step"  ;
    
    # For every step, see if there's a value in the hash and print it, otherwise, empty field.
      for(my $field_number = 0; $field_number < 6 ; $field_number++) {
        if (exists $records{$id_step}{$field_number}) { $line = $line . "$records{$id_step}{$field_number}," ; next }
        else {
            $line = $line . ","  ;
        }
      chop $line ;
      print $line . "\n" ;
    }
    
    print "\n" ;
    

    通过排序运行时会给出以下输出:

    1,0,239.09,,,,
    1,1,5.4,758.71,281.1,,
    1,2,5.34,,,670.92,
    1,3,,,135.84,,
    1,4,,103.95,,,
    1,5,,277.33,,,
    1,6,447.74,52.14,549.54,12.24,"Doe, John"
    1,7,94.85,,,,
    2,0,,,,507.95,"Smith, John"
    2,1,,,142.95,,
    3,0,,,9.43,,
    3,1,90.43,,,,
    3,2,,451.72,578.68,539.89,"Roe, Jane"
    4,0,,1.32,,,
    

    【讨论】:

    • 输出与 OP 的预期输出不匹配。您没有生成有效的 CSV;每行应该有相同的列数。也没有指示特定值属于哪个字段。第二行中的670.92 是否对应于field_2field_3field_4field_5
    • 您现在的输出中有太多列。您还没有包含标题行。
    猜你喜欢
    • 1970-01-01
    • 2019-07-01
    • 1970-01-01
    • 2014-09-03
    • 1970-01-01
    • 2021-06-27
    • 1970-01-01
    • 1970-01-01
    • 2014-08-14
    相关资源
    最近更新 更多