收集具有相似列的数据答案

【问题标题】：Gather the data with similar columns收集具有相似列的数据
【发布时间】：2012-03-16 21:26:36
【问题描述】：

我想从 unix 中的文本文件中过滤数据。我在 unix 中有如下文本文件：

如何根据我在 awk 中的上述数据修改/创建如下数据？

A 200 100
B 300 600 700
C 400

我不太擅长 awk，但我相信 awk/perl 最适合这个。

【问题讨论】：

标签： perl unix sed awk

【解决方案1】：

awk 'END {
  for (R in r) 
    print R, r[R]
  }
{
  r[$1] = $1 in r ? r[$1] OFS $2 : $2
  }' infile

如果第一个字段中值的顺序很重要，将需要更多代码。解决方案将取决于您的 awk 实现和版本。

解释：

r[$1] = $1 in r ? r[$1] OFS $2 : $2

将数组r元素$1的值设置为：

如果密钥 $1 已经存在：$1 in r，追加 OFS $2 到现有值
否则将其设置为 $2 的值

表达式？ if true : 如果 false 是三元运算符。请参阅ternary operation 了解更多信息。

【讨论】：

能否也请您解释一下r[$1] = $1 in r ? r[$1] OFS $2 : $2这一行
嗨@peter，我已经添加了解释。
第一个猜测：你在 Solaris 上。如果是这种情况，您应该使用 nawk 或 /usr/xpg4/bin/awk，而不是 /usr/bin/awk。
猜对了...正是我在 solaris 上。因为我之前投了反对票。即使其他答案也正确，我也接受你的回答。使用 nawk 解决了问题。

【解决方案2】：

你可以这样做，但是使用 Perl 总是有不止一种方法可以做到：

my %hash; 
while(<>) { 
    my($letter, $int) = split(" "); 
    push @{ $hash{$letter} }, $int;
} 

for my $key (sort keys %hash) {
    print "$key " . join(" ", @{ $hash{$key} }) . "\n";
}

应该这样工作：

$ cat data.txt | perl script.pl
A 200 100
B 300 600 700
C 400

【讨论】：

一班人：perl -i.bk -e 'my %rows; while (<>) { chomp; if ($_ ne ""){ my ($id, $value) = split(); print "$id $value\n"; push (@{$rows{$id}}, $value); }} foreach (keys %rows){print $_ . " @{$rows{$_}}\n"; }' data.txt

【解决方案3】：

不是特定语言的。更像是伪代码，但想法是这样的：

- Get all lines in an array
- Set a target dictionary of arrays

- Go through the array :
       - Split the string using ' '(space) as the delimiter, into array parts
       - If there is already a dictionary entry for `parts[0]` (e.g. 'A'). 
         If not create it.
       - Add `parts[1]` (e.g. 100) to `dictionary(parts[0])`

就是这样！ :-)

我会这样做，可能是在 Python 中，但这只是个人喜好问题。

【讨论】：

【解决方案4】：

使用awk，对其中的输出进行排序：

awk '
  { data[$1] = (data[$1] ? data[$1] " " : "") $2 } 
  END { 
    for (i in data) { 
      idx[++j] = i 
    } 
    n = asort(idx); 
    for ( i=1; i<=n; i++ ) { 
      print idx[i] " " data[idx[i]] 
    } 
  }
' infile

使用外部程序sort:

awk '
  { data[$1] = (data[$1] ? data[$1] " " : "") $2 } 
  END { 
    for (i in data) { 
      print i " " data[i] 
    } 
  }
' infile | sort

对于这两个命令的输出是：

A 200 100
B 300 600 700
C 400

【讨论】：

【解决方案5】：

使用sed：

script.sed的内容：

## First line. Newline will separate data, so add it after the content.
## Save it in 'hold space' and read next one.
1 {
    s/$/\n/
    h   
    b   
}

## Append content of 'hold space' to current line.
G

## Search if first char (\1) in line was saved in 'hold space' (\4) and add 
## the number (\2) after it.
s/^\(.\)\( *[0-9]\+\)\n\(.*\)\(\1[^\n]*\)/\3\4\2/

## If last substitution succeed, goto label 'a'.
ta

## Here last substitution failed, so it is the first appearance of the
## letter, add it at the end of the content.
s/^\([^\n]*\n\)\(.*\)$/\2\1/

## Label 'a'.
:a

## Save content to 'hold space'.
h

## In last line, get content of 'hold space', remove last newline and print.
$ {
    x   
    s/\n*$//
    p   
}

像这样运行它：

sed -nf script.sed infile

结果：

A 200 100
B 300 600 700
C 400

【讨论】：

【解决方案6】：

这可能对你有用：

sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]*\)\( .*\)\n\1/\1\2/;ta;P;D'
A 200 100
B 300 600 700
C 400

【讨论】：