uniq 排序解析答案

【问题标题】：uniq sort parsinguniq 排序解析
【发布时间】：2018-09-29 23:30:59
【问题描述】：

我有一个字段用“;”分隔的文件，如下所示：

test;group;10.10.10.10;action2
test2;group;10.10.13.11;action1
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test5;group2;10.10.10.12;action5
test6;group4;10.10.13.11;action8

我想识别所有非唯一 IP 地址（第 3 列）。对于示例，摘录应为：

test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8

按 IP 地址排序（第 3 列）。

发出简单的命令，例如 cat、uniq、sort、awk（不是 Perl，不是 Python，只有 shell）。

有什么想法吗？

【问题讨论】：

如果您使用的是 cat、uniq、sort、awk，那么您不是在使用“仅 shell”。从 shell 的角度来看，Perl 和 Python 等价于 Awk。
除了 awk 是每个 UNIX 安装的标准，而 perl 和 python 不是。 awk 比 perl 或 python 更像 sed 和 grep wrt 可用性。

标签： shell sorting awk uniq

【解决方案1】：

$ awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file|sort -t";" -k3
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8

awk 选择所有重复的 ($3) 行
sort 按 ip 排序

【讨论】：

【解决方案2】：

您也可以使用grep、cut、sort、uniq 以及中间的临时进程替换来尝试此解决方案。

grep -f <(cut -d ';' -f3 file | sort | uniq -d) file | sort -t ';' -k3

它不是很优雅（我实际上更喜欢上面给出的awk 答案），但我认为值得分享，因为它完成了你想要的。

【讨论】：

【解决方案3】：

这与 Kent 的答案非常相似，但只需通过文件一次。权衡是内存：您需要存储要保留的行。这将 GNU awk 用于 PROCINFO 变量。

awk -F';' '
    {count[$3]++; lines[$3] = lines[$3] $0 ORS} 
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (key in count) 
            if (count[key] > 1) 
                printf "%s", lines[key]
    }
' file

等价的perl

perl -F';' -lane '
    $count{$F[2]}++; push @{$lines{$F[2]}}, $_
  } END {
    print join $/, @{$lines{$_}}
        for sort grep {$count{$_} > 1} keys %count
' file

【讨论】：

【解决方案4】：

这是另一个awk 辅助管道

$ awk -F';' '{print $0 "\t" $3}' file | sort -sk2 | uniq -Df1 | cut -f1

test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8

single pass，所以特殊缓存；也保持原来的顺序（稳定排序）。假设选项卡未出现在字段中。

【讨论】：

【解决方案5】：

awk + sort + uniq + cut:

$ awk -F ';' '{print $0,$3}' <file> | sort -k2 | uniq -D -f1 | cut -d' ' -f1

sort + awk

$ sort -t';' -k3,3 | awk -F ';' '($3==k){c++;b=b"\n"$0}($3!=k){if (c>1) print b;c=1;k=$3;b=$0}END{if(c>1)print b}

awk

$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
      END{for (i in k) if(k[i]>1) for(j=1;j<=k[i];j++) print b[i"_"j] } <file>

这会缓冲整个文件（与sort 相同）并跟踪密钥k 出现的次数。最后，如果出现的键多于一个，则打印完整集。

test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4

如果你想要它排序：

$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
      END{ asorti(k,l); 
      for (i in l) if(k[l[i]]>1) for(j=1;j<=k[l[i]];j++) print b[l[i]"_"j] } <file>

【讨论】：