Powershell / Perl：将多个 CSV 文件合并为一个？答案

【问题标题】：Powershell / Perl : Merging multiple CSV files into one?Powershell / Perl：将多个 CSV 文件合并为一个？
【发布时间】：2011-05-14 02:43:41
【问题描述】：

我有以下 CSV 文件，我想将它们合并成一个 CSV

01.csv

apples,48,12,7
pear,17,16,2
orange,22,6,1

02.csv

apples,51,8,6
grape,87,42,12
pear,22,3,7

03.csv

apples,11,12,13
grape,81,5,8
pear,11,5,6

04.csv

apples,14,12,8
orange,5,7,9

期望的输出：

apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,87,42,12,81,5,8,,,
pear,17,16,2,22,3,7,11,5,6,,,
orange,22,6,1,,,,,,5,7,9

任何人都可以就如何实现这一目标提供指导吗？最好使用 Powershell，但如果这样更容易，也可以使用 Perl 等替代方案。

感谢 Pantik，您的代码输出接近我想要的：

apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,87,42,12,81,5,8
orange,22,6,1,5,7,9
pear,17,16,2,22,3,7,11,5,6

不幸的是，当条目不存在于 CSV 文件中时，我需要使用“占位符”逗号，例如橙色,22,6,1,,,,,,5,7,9 而不是橙色,22,6,1,5,7,9

更新：我希望按照文件名的顺序解析这些，例如：

$myFiles = @(gci *.csv) | sort Name
foreach ($file in $myFiles){

问候泰德

【问题讨论】：

看起来您想要按文件名排序的数据。例如，您在orange 中有来自2.csv 和3.csv 的空记录。如果这是一项要求，您应该将其添加到问题中。

标签： perl powershell

【解决方案1】：

这是我的 Perl 版本：

use strict;
use warnings;

my $filenum = 0;

my ( %fruits, %data );
foreach my $file ( sort glob("*.csv") ) {

    $filenum++;
    open my $fh, "<", $file or die $!;

    while ( my $line = <$fh> ) {

        chomp $line;

        my ( $fruit, @values ) = split /,/, $line;

        $fruits{$fruit} = 1;

        $data{$filenum}{$fruit} = \@values;
    }

    close $fh;
}
foreach my $fruit ( sort keys %fruits ) {

    print $fruit, ",", join( ",", map { $data{$_}{$fruit} ? @{ $data{$_}{$fruit} } : ",," } 1 .. $filenum ), "\n";
}

这给了我：

apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,

那么，你有一个关于葡萄的错字还是我误解了什么？

【讨论】：

不过，您的默认排序不适用于9.csv 以外的文件名，因为11.csv 将排在2.csv 之前。
谢谢 gangablass... 是的，在我想要的输出中我有一个错字，葡萄后面不应该有空格。已更新。

【解决方案2】：

好的，gangabas 解决方案有效，而且比我的更酷，但我还是会添加我的。它稍微严格一些，并且保留了一个也可以使用的数据结构。所以，享受吧。 ;)

use strict;
use warnings;

opendir my $dir, '.' or die $!;
my @csv = grep (/^\d+\.csv$/i, readdir $dir);
closedir $dir;
# sorting numerically based on leading digits in filename
@csv = sort {($a=~/^(\d+)/)[0] <=> ($b=~/^(\d+)/)[0]} @csv;

my %data;

# To print empty records we first need to know all the names
for my $file (@csv) {
    open my $fh, '<', $file or die $!;
    while (<$fh>) {
        if (m/^([^,]+),/) {
            @{ $data{$1} } = ();
        }
    }
    close $fh;
}

# Now we can fill in values
for my $file (@csv) {
    open my $fh, '<', $file or die $!;
    my %tmp;
    while (<$fh>) {
        chomp;
        next if (/^\s*$/);
        my ($tag,@values) = split (/,/);
        $tmp{$tag} = \@values;
    }
    for my $key (keys %data) {
        unless (defined $tmp{$key}) {
            # Fill in empty values
            @{$tmp{$key}} = ("","","");
        }
        push @{ $data{$key} }, @{ $tmp{$key} };
    }
}

&myreport; 

sub myreport {
    for my $key (sort keys %data) {
        print "$key," . (join ',', @{$data{$key}}), "\n";
    }
}

【讨论】：

【解决方案3】：

Powershell：

$produce = "apples","grape","orange","pear"
$produce_hash = @{}
$produce | foreach-object {$produce_hash[$_] = @(,$_)}

$myFiles = @(gci *.csv) | sort Name
 foreach ($file in $myFiles){ 
    $file_hash = @{}
    $produce | foreach-object {$file_hash[$_] = @($null,$null,$null)}
        get-content $file | foreach-object{
            $line = $_.split(",")
            $file_hash[$line[0]] = $line[1..3]
            }
    $produce | foreach-object {
        $produce_hash[$_] += $file_hash[$_]
        }
  }

$ofs = ","
$out = @()
$produce | foreach-object {
 $out += [string]$produce_hash[$_]
 }

$out | out-file "outputfile.csv" 

gc outputfile.csv

apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,

应该很容易修改其他项目。只需将它们添加到 $produce 数组中。

【讨论】：

感谢 mjolinor，是否可以进行修改，以便您无需手动输入 $produce 数组中的项目...因为可能无法提前知道这些项目将是什么...
可能的。我看到的 2 种方法： 1- 读取数据两次，使用第一遍收集第一个元素的唯一值以构建 $produce 数组。 2 - 设置一个计数器并在处理每个文件时递增，这样您就知道您可能需要在为该项目获得的第一组值之前添加 $nulls 数组。哪一个效果最好可能取决于您的数据文件有多少/有多大。
发布了第二个自动填充 $produce 的解决方案

【解决方案4】：

第二个 Powershell 解决方案（根据要求）

   $produce = @()
   $produce_hash = @{}
    $file_count = -1
    $myFiles = @(gci 0*.csv) | sort Name
     foreach ($file in $myFiles){ 
        $file_count ++
        $file_hash = @{}
                get-content $file | foreach-object{
                $line = $_.split(",")

                if ($produce -contains $line[0]){
                    $file_hash[$line[0]] += $line[1..3]
                    }

                else {
                    $produce += $line[0]
                    $file_hash[$line[0]] = @(,$line[0]) + (@($null) * 3 *  $file_count) + $line[1..3]
                    }

                  }
              $produce | foreach-object { 
                if ($file_hash[$_]){$produce_hash[$_] += $file_hash[$_]} 
                else {$produce_hash[$_] += @(,$null) * 3}
               }

    }          

    $ofs = ","
    $out = @()
    $produce_hash.keys | foreach-object {
     $out += [string]$produce_hash[$_]
     }

    $out | out-file "outputfile.csv" 

    gc outputfile.csv
apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,

【讨论】：

【解决方案5】：

你必须解析文件，我没有看到更简单的方法来做它

powershell 中的解决方案：

更新：好的，稍作调整 - 希望可以理解

$items = @{}
$colCount = 0 # total amount of columns
# loop through all files
foreach ($file in (gci *.csv | sort Name))
{
    $content = Get-Content $file
    $itemsToAdd = 0; # columns added by this file
    foreach ($line in $content)
    {
        if ($line -match "^(?<group>\w+),(?<value>.*)") 
        { 
            $group = $matches["group"]
            if (-not $items.ContainsKey($group)) 
            {   # in case the row doesn't exists add and fill with empty columns
                $items.Add($group, @()) 
                for($i = 0; $i -lt $colCount; $i++) { $items[$group] += "" }
            }

            # add new values to correct row
            $matches["value"].Split(",") | foreach { $items[$group] += $_ }
            $itemsToAdd = ($matches["value"].Split(",") | measure).Count # saves col count
        } 
    }

    # in case that file didn't contain some row, add empty cols for those rows
    $colCount += $itemsToAdd
    $toAddEmpty = @()
    $items.Keys | ? { (($items[$_] | measure).Count -lt $colCount) } | foreach { $toAddEmpty += $_ }
    foreach ($key in $toAddEmpty) 
    {   
        for($i = 0; $i -lt $itemsToAdd; $i++) { $items[$key] += "" }
    }
}

# output
Remove-Item "output.csv" -ea 0
foreach ($key in $items.Keys)
{
    "$key,{0}" -f [string]::Join(",", $items[$key]) | Add-Content "output.csv"
}

输出：

apples,48,12,7,51,8,6,11,12,13,14,12,8
grape,,,,87,42,12,81,5,8,,,
orange,22,6,1,,,,,,,5,7,9
pear,17,16,2,22,3,7,11,5,6,,,

【讨论】：

感谢 PantikT 的努力，非常感谢 - 请查看我对我的问题的更新以获取反馈，因为这并不能完全生成我正在寻找的输出。

【解决方案6】：

这是一种更简洁的方法。但是，当项目丢失时，它仍然不会添加逗号。

Get-ChildItem D:\temp\a\ *.csv | 
    Get-Content |
    ForEach-Object -begin { $result=@{} } -process {
        $name, $otherCols = $_ -split '(?<=\w+),'
        if (!$result[$name]) { $result[$name] = @() }
        $result[$name] += $otherCols
    } -end {
        $result.GetEnumerator() | % {
            "{0},{1}" -f $_.Key, ($_.Value -join ",")
        }
    } | Sort

【讨论】：