如何通过awk在一个文件的并排列中输出多个文件的数据？答案

【问题标题】：how to output data from multiple files in side by side columns in one file via awk?如何通过awk在一个文件的并排列中输出多个文件的数据？
【发布时间】：2015-08-07 15:33:10
【问题描述】：

我有 30 个文件，分别称为 UE1.dat、UE2.dat ....，每个文件都有 4 列。下面给出了 UE1.dat 和 UE2.dat 的列结构示例。

UE1.dat

UE2.dat

所以，我尝试了以下代码：

for((i=1;i<=30;i++)); do awk 'NR$i {printf $1",";next} 1; END {print ""}' UE$i.dat; done > UE_all.dat

要从每个文件中只获取第一列并将它们写入单个文件和列并排，所需的输出如下所示。

但不幸的是，代码将它们按行排列，您能给个提示吗？

提前谢谢你！

【问题讨论】：

随着您的输出 - 您每次都在第一列之后吗？它应该如何查找其他文件 - 30 列宽，全部来自每个文件的“第 1 列”？它必须是awk吗？
不，没有必要是 awk，但我在那种情况下使用了它。我刚刚编辑了这个问题，所以你可以在那里检查所需的输出。谢谢！

标签： awk multiple-columns

【解决方案1】：

在 awk 中你可以这样做：

1) 将此代码放入名为output_data_from_multiple_files.awk的文件中：

BEGIN {
    # All the input files are processed in one run.
    # filenumber counts the number of input files.
    filenumber = 1
}

{
    # FNR is the input record number in the current input file.
    # Concatenate the value of the first column in the corresponding
    # line in the output.
    output[FNR] = output[FNR] " " $1

    # FNR == 1 means we are processing a new file.
    if (FNR == 1) {
        ++filenumber
    }
}

END {
    # print the output
    for (i=1; i<=FNR; i++)
        printf("%s\n", output[i])
}

2) 运行awk -f output_data_from_multiple_files.awk UE*

所有文件都在awk 的一次执行中处理。 FNR 是当前输入文件中的输入记录号。 filenumber 用于统计处理的文件数。输入文件中读取的值连接到output 数组中。

【讨论】：

也可以，但是不同列之间有一些奇怪的选项卡。谢谢！
@TrifonGetsov 这很奇怪，脚本在连接值时只插入一个空格。好吧，事实上我并没有那样执行它，我将代码放在一个文件中。我将编辑我的答案。

【解决方案2】：

使用awk 关联数组将所有列连接到一个文件中：

# use a wildcard to get all the files (could also use a for-loop)
# add each new row to the array using line number as an index
# at the end of reading all files, go through each index (will be 1-4 in 
# your example) and print index, and then the fully concatenated rows
awk '{a[FNR] = a[FNR]" "$0}END{ for (i in a) print i, a[i] | "sort -k1n"}' allfiles*

【讨论】：

【解决方案3】：

我可能会使用类似的东西 - 使用 perl 而不是 awk 因为我更喜欢数据结构的处理。在这种情况下 - 我们使用二维数组，将每个文件的第一列插入到数组的新列中，然后打印整个内容。

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my $num_files = 2; 

my @rows;
my $count = 0; 
my $max = 0; 

for my $filenum ( 1..$num_files ) {
    open ( my $input, "<", "UE${filenum}.dat" ) or die $!;
    while ( <$input> ) {
        my @fields = split;
        push ( @{$rows[$filenum]}, $fields[0] );
        $count++;
    } 
    close ( $input ); 
    if ( $count > $max ) { $max = $count };
}

print Dumper \@rows;

for ( 0..$count ) { 
    foreach my $filenum ( 1..$num_files ) {
       print shift @{$rows[$filenum]} || ''," ";
    }
    print "\n";
}

【讨论】：

它给了我：不能在 ./script 第 24 行使用未定义的值作为 ARRAY 引用？？我在第 24 行看不到任何不一致。
啊，是的，对不起，perl 从零开始数组。所以你可能需要； @columns[1..30] （已编辑答案）
抱歉，再次编辑 - 希望这个更好一点。
这次不行:)，它只打印前三列，它们之间有奇怪的空间。但是以前的版本可以工作:)
这是输出的第一行：$VAR1 = [ undef, [

【解决方案4】：

我的解决办法是这样的

gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE* | sort -nk 1,2 | cut -d" " -f3 | xargs -L $(ls UE*.dat | wc -l)

这就是我得到它的方法...我使用gawk对行和文件进行编号，然后按行号排序，然后按文件排序，只需使用sort并删除文件和行号。所以...

gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE*

1 1 1  # line 1 file 1 is 1
2 1 2  # line 2 file 1 is 2
3 1 3  # line 3 file 1 is 3
4 1 4  # line 4 file 1 is 4
1 2 2  # line 1 file 2 is 2
2 2 4  # line 2 file 2 is 4
3 2 7  # line 3 file 2 is 7
4 2 9  # line 4 file 2 is 9

然后像这样使用sort 把文件1的第一行后面跟着文件2的第一行，文件n的第一行，文件1的第二行，文件2的第二行，文件n的第二行.然后得到第三列：

gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE* | sort -nk 1,2 | cut -d" " -f3
1
2
2
4
3
7
4
9

然后将它们与xargs重新组合在一起

gawk 'BEGINFILE{f++}{print FNR,f,$1}' UE* | sort -nk 1,2 | cut -d" " -f3 | xargs -L2
1 2
2 4
3 7
4 9

末尾的-L2 必须与文件数匹配，即在您的情况下为-L30。

【讨论】：