基于文件中第一个字段的唯一行数答案

【问题标题】：Count of unique lines based on first field in file基于文件中第一个字段的唯一行数
【发布时间】：2012-11-29 03:10:33
【问题描述】：

我正在尝试根据第一个输出到文件的唯一行数字段，输入行如下所示：

Forms.js     /forms/Forms.js     http://www.gumby.com/test.htm   404
Forms.js     /forms/Forms1.js    http://www.gumby.com/test.htm   404
Forms.js     /forms/Forms2.js    http://www.gumby.com/test.htm   404
Interpret.js     /forms/Interpret1.js    http://www.gumby.com/test.htm   404    
Interpret.js     /forms/Interpret2.js    http://www.gumby.com/test.htm   404
Interpret.js     /forms/Interpret3.js    http://www.gumby.com/test.htm   404

到这样的事情：

3    Forms.js    /forms/Forms.js     http://www.gumby.com.mx/test.htm 404
3    Interpret.js    /forms/Interpret.js    http://www.gumby.com.mx/test.htm  404

我一直在尝试各种 sort 和 uniq 组合，但还没有成功。我可以使用整行获得不同的行，但我只想要第一个字段。我目前正在使用cygwin。我不识字，但我怀疑这是要走的路。谁有方便的解决方案？

【问题讨论】：

标签： sed awk uniq

【解决方案1】：

Awk 是实现此目的的工具，但如果您想巧妙地使用uniq：

$ column -t file | uniq -w12 -c
      3 Forms.js      /forms/Forms.js       http://www.gumby.com/test.htm  404
      3 Interpret.js  /forms/Interpret1.js  http://www.gumby.com/test.htm  404

column -t 对齐所有列，因此我们得到第一列的固定宽度。

如果column 不可用，则hack 将第一列附加到行尾awk，然后使用uniq -c -f4 计算最后一列的唯一性，然后再次使用awk 打印n-1 字段。

$ awk '{print $0, $1}' file | uniq -c -f4 | awk '{$NF=""; NF--; print}'
3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404

如果uniq -f 像-f4,4 或f1,1 一样工作就好了。

或者你可以使用rev 来反转文件，这样uniq -c -f3 可以完成，然后rev 返回（但是你会在最后得到计数，如果你没有column 你可能没有rev)

$ rev file | uniq -c -f3 | rev
Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404 3      
Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404 3

【讨论】：

不幸的是，cygwin 似乎不支持“列”；否则，它似乎是我需要的。
column 支持 Cygwin 包中的 util-linux :)
如果column 不可用，则添加 hack。

【解决方案2】：

我只想cut -f 1 | uniq -c。这不会给你整行，但如果行不同，打印任何行无论如何都没有太大意义。取决于你想要达到的目标。

【讨论】：

【解决方案3】：

你可以用cut计算第一个字段的数量，但是你想在这个字段之后打印什么？

cat file | cut -d " " -f 1 | uniq -c

【讨论】：

【解决方案4】：

这个：

<infile awk '{ h[$1]++ } END { for(k in h) print h[k], k }'

会得到你：

3 Forms.js
3 Interpret.js

如果您还想保留第一次点击使用：

awk '!h[$1] { g[$1]=$0 } { h[$1]++ } END { for(k in g) print h[k], g[k] }'

输出：

3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404

使用 GNU awk 测试。

请注意，这不需要对输入进行排序。另请注意，结果是无序的。

【讨论】：

【解决方案5】：

假设 file.txt 包含您的示例输入：

sort file.txt | awk -f counts.awk file

返回：

3:Forms.js     /forms/Forms.js     http://www.gumby.com/test.htm   404
3:Interpret.js     /forms/Interpret1.js    http://www.gumby.com/test.htm   404

awk 脚本文件：

cat counts.awk

#  output format is:
#+ TimesFirstFieldIsRepeated:FirstMatchingLineContents

BEGIN {

  plmatch="";
  pline="";
  outline="";
  n=1;

 }

{

 if($1 != plmatch && NR != 1)
  {
   print n ":" outline;
   n=1;
   outline="";
  }

 if($1 == plmatch)
  {
   n+=1;
   if(outline == ""){
     outline=pline;
    }
  }

 plmatch=$1;
 pline=$0;

}

END {
  print n ":" outline;
 }

【讨论】：

【解决方案6】：

$ awk '!c[$1]++{v[$1]=$0} END{for (i in c) print c[i],v[i]}' file
3 Forms.js     /forms/Forms.js     http://www.gumby.com/test.htm   404
3 Interpret.js     /forms/Interpret1.js    http://www.gumby.com/test.htm   404

上面使用常见的 awk 习惯用法 '!array[$n]++' 来判断一个键值（$n，其中 n 是 $0 或 $1 或 $4,$5 或 ...）是否曾被看到过。

【讨论】：