如何查找文本文件中多个单词的计数？答案

【问题标题】：How do I find the count of multiple words in a text file?如何查找文本文件中多个单词的计数？
【发布时间】：2011-08-24 07:28:07
【问题描述】：

我能够找到一个单词在文本文件中出现的次数，就像在 Linux 中我们可以使用的那样：

cat filename|grep -c tom

我的问题是，我如何在文本文件中找到多个单词（如“tom”和“joe”）的计数。

【问题讨论】：

grep 计算行数，而不是字数。上面有tomtom 的行算一还是二？
你到底想要什么？多个计数，每个您指定的每个单词一个？您指定的所有单词的计数总和？什么是“单词” - 正如 tchrist 已经提到的，您的示例计算匹配正则表达式的行数，而不是单词数。

标签： linux shell

【解决方案1】：

因为你有几个名字，正则表达式是这个名字的方法。起初我认为这就像对 joe 或 tom 的正则表达式的 grep 计数一样简单，但发现这并没有考虑 tom 和 joe 在同一行（或 tom 和 tom 的情况）的情况.

test.txt：

tom is really really cool!  joe for the win!
tom is actually lame.


$ grep -c '\<\(tom\|joe\)\>' test.txt
2

从 test.txt 文件中可以看出，2 是错误的答案，因此我们需要考虑名称在同一行。

然后，我使用 grep -o 仅显示匹配行中与模式匹配的部分，其中它在文件中给出了正确的 tom 或 joe 模式匹配。然后，我将结果通过管道传输到 wc 中以获取行数。

$ grep -o '\(joe\|tom\)' test.txt|wc -l
       3

3...正确答案！希望这会有所帮助

【讨论】：

我稍微修改了正则表达式来处理tomtom的情况。很好的测试用例...感谢您指出。
真正困难的测试用例将涉及原始单词的重叠匹配。 :) 例如，如果您想要计数的单词是 cure、core、rely、lysis、island、land 和 dish，那么您将获得 2 次点击insecurely 和 outlandish 之类的东西，以及 islandish 和 corelysis 之类的东西的 3 次点击。一种幼稚的方法会将这些仅视为一个。使用一个正则表达式并不好玩，但使用其中 N 个非常容易，每个单词一个。

【解决方案2】：

好的，所以先把文件拆分成单词，然后sort和uniq：

tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c

~~你使用uniq:~~

sort filename | uniq -c

【讨论】：

哎呀。下次，我正确地阅读问题如何？掌心
这个（分词、选择、计数）也是我的选择。将不是:alnum: 的内容替换为\n 时，您可能需要注意语言差异，例如cat Castilian/*.txt | tr A-Z a-z | tr -cs '[a-záóúíéñ]' '\n' | sort | uniq -c | sort -n

【解决方案3】：

使用 awk：

{for (i=1;i<=NF;i++)
    count[$i]++
}
END {
    for (i in count)
        print count[i], i
}

这将为输入生成完整的词频计数。将输出通过管道传输到grep 以获取所需的字段

awk -f w.awk input | grep -E 'tom|joe'

顺便说一句，您的示例中不需要cat，大多数充当过滤器的程序都可以将文件名作为参数；因此最好使用

grep -c tom filename

如果没有，很有可能人们会开始向你扔Useless Use of Cat Award ;-)

【讨论】：

“大多数充当过滤器的程序都可以将文件名作为参数”......即使它们不这样做，您仍然可以使用输入重定向（如grep -c tom < filename）。
grep -c 不查找单词，因此您必须搜索它。

【解决方案4】：

您提供的示例不搜索单词“tom”。它将计算“原子”和“底部”等等。
Grep 搜索正则表达式。匹配单词“tom”或“joe”的正则表达式是
```
\<\(tom\|joe\)\>
```

【讨论】：

【解决方案5】：

你可以做正则表达式，

 cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"

【讨论】：

您的解决方案甚至将 joe 和 tom 放在同一行。不错！
@Travis：但是，它错误地只计算了一次tomtom，即使我的爷爷也能看到有两个toms 在场。

【解决方案6】：

这是一个：

cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c

更新

一个shell脚本解决方案：

#!/bin/bash

file_name="$2"
string="$1"

if [ $# -ne 2 ]
  then
   echo "Usage: $0 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo

【讨论】：

【解决方案7】：

我完全忘记了 grep -f:

cat filename | grep -fc names

AWK 解决方案：

假设名称在一个名为 names 的文件中：

cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -

请注意，您的原始 grep 不会搜索单词。例如

$ echo tomorrow | grep -c tom
1

你需要grep -w

【讨论】：

【解决方案8】：

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'

gawk 程序将记录分隔符设置为任何非字母字符，因此每个单词都会在单独的行中结束。然后 grep 计算与您想要的单词之一完全匹配的行数。

我们使用 gawk 是因为 POSIX awk 不允许正则表达式记录分隔符。

为简洁起见，您可以将 '{print}' 替换为 1 - 无论哪种方式，它都是一个简单地打印出所有输入记录的 Awk 程序（“1 是真的吗？它是？然后执行默认操作，即{print}.")

【讨论】：

【解决方案9】：

查找所有行中的所有匹配项

echo "tom is really really cool!  joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3

这会将“tomtom”计为 2 次点击。

【讨论】：