计算文本文件中单词的出现次数答案

【问题标题】：Counting the number of occurrences of words in a textfile计算文本文件中单词的出现次数
【发布时间】：2008-12-25 22:34:51
【问题描述】：

如何跟踪一个单词在文本文件中出现的次数？我想对每个单词都这样做。

例如，如果输入是这样的：

“那个男人向男孩打招呼。”

每个“man said hi to boy”都会出现 1 个。

“the”的出现次数为 2。

我正在考虑保留一个包含单词/出现对的字典，但我不确定如何在 C 中实现这一点。链接到任何类似或相关问题的解决方案会很棒。

编辑：为了避免推出我自己的哈希表，我决定学习如何使用 glib。一路上，我发现了一个很好的教程，它解决了类似的问题。 http://bo.majewski.name/bluear/gnu/GLib/ch03s03.html

我对不同方法的数量感到震惊，尤其是 Ruby 实现的简单性和优雅性。

【问题讨论】：

这个话题什么时候变成了“用你选择的语言分享你对这个问题的解决方案？”

标签： c algorithm nlp counting

【解决方案1】：

是的，具有单词出现对的字典可以正常工作，实现这种字典的通常方法是使用哈希表（或者，有时，二叉搜索树）。

您也可以使用trie（或其压缩版本，"Patricia trie"/Radix trie），其复杂性对于这个问题是渐近最优的，尽管我怀疑在实践中它可能比（好的）哈希表慢实施。

[我真的认为哈希表或尝试是否更好取决于输入中单词的分布——例如哈希表需要将每个单词存储在其哈希桶中（以防止冲突），而如果您有很多具有公共前缀的单词，则在 trie 中，这些公共前缀是共享的，并且每个只需要存储一次，但是仍然存在所有指针的开销...如果您碰巧尝试了两者，我很想知道它们的比较。]

【讨论】：

【解决方案2】：

只是为了好奇，这里有一个简单的 Ruby 解决字数问题的方法。应该和 C 语言中的算法基本相同，只是代码要多很多。

h = Hash.new(0)
File.read("filename.txt").split.each do |w|
  h[w] += 1
end
p h

【讨论】：

【解决方案3】：

这算不算？

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
    char buffer[2048];
    if (argc != 2)
    {
        fprintf(stderr, "Usage: %s file\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    snprintf(buffer, sizeof(buffer), "tr -cs '[a-z][A-Z]' '[\\n*]' < %s |"
                                     " sort | uniq -c | sort -n", argv[1]);
    return(system(buffer));
}

它基本上封装了说明如何在 Unix 上将单词计数为 shell 脚本的规范脚本。

'tr' 命令将非字母字符转换为换行符并排除重复项。第一个“sort”将每个单词的所有出现组合在一起。 'uniq -c' 计算每个单词的连续出现次数，打印单词及其计数。第二个“sort”将它们按重复次数增加的顺序排列。您可能需要选择 'tr';它不是从系统到系统的最稳定的命令，它设法让我经常进行手动抨击。在使用 /usr/bin/tr 的 Solaris 10 上，上面的代码会生成（在其自己的源代码上）：

   1
   1 A
   1 EXIT
   1 FAILURE
   1 Usage
   1 Z
   1 a
   1 c
   1 cs
   1 exit
   1 file
   1 fprintf
   1 if
   1 main
   1 return
   1 sizeof
   1 snprintf
   1 stderr
   1 stdio
   1 stdlib
   1 system
   1 tr
   1 uniq
   1 z
   2 argc
   2 char
   2 h
   2 include
   2 int
   2 s
   2 sort
   3 argv
   3 n
   4 buffer

【讨论】：

这很好用。我不知道'tr'命令。感谢您的分享和精彩的解释:-)

【解决方案4】：

您可以使用哈希表，并让哈希表中的每个条目都指向一个结构，该结构包含该单词和到目前为止已找到的次数。

【讨论】：

在两个不同的词散列到同一个条目时是否可能发生冲突？我是否必须对条目进行一些检查，或者是否存在完美的哈希函数？我有点生疏，但我会做我的研究。谢谢
这是通常的方法。您需要确保不会发生冲突——通常通过使每个哈希桶成为字数结构的链接列表。 +1
我上次看到它时这不是 +1 吗？为什么有人会否决正确答案？ :P +1 来自我。

【解决方案5】：

对于单个单词，根本不需要编写程序，除非这是更大的一部分：

sed -e 's/[[:space:]]/\n/g' < file.txt | grep -c WORD

【讨论】：

【解决方案6】：

在 Perl 中：

my %wordcount = ();
while(<>){map {$wordcount{$_}++} (split /\s+/)}
print "$_ = $wordcount{$_}\n" foreach sort keys %wordcount;

在 Perl Golf 中（只是为了好玩）：

my%w;                       
map{$w{$_}++}split/\s+/while(<>); 
print"$_=$w{$_}\n"foreach keys%w;

【讨论】：

【解决方案7】：

警告未经测试的代码：

#include <stdio.h>

struct LLNode
{
    LLNode* Next;    
    char*   Word;
    int     Count;
};

void PushWord(LLNode** list, const char* word)
{
    LLNode* node = NULL;
    unsigned int len = 0;
    if (*list == NULL) 
    {
        $list = new LLNode;
        $list = "\0";
    }
    node = *list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        if (!strcmp(node->Word, word))
        {
            node->Count++;
            break;
        }

        if (!node->Next)
        {
            LLNode* nnode = new LLNode;
            nnode->Count = 1;
            node->Next = nnode;
            len = strlen(word);
            node->Word = new char[len + 1];
            strcpy(node->Word, word);
            break;
        }
    }
}

void GetCounts(LLNode* list)
{
    if (!list)
        return;
    LLNode* node = list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        printf("Word: %s, Count: %i", node->Word, node->Count);
    }
}

void PushWords(LLNode** list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    char buff[len]; // to be sure we have no buffer ovverunes. May consume too much memery for your application though.
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            ch[index + 1] = '\0';
            PushWords(list, buff);
            index = 0;
        }
        else if (ch != ' ')
        {
            ch[index++] = ch;
        }
    }

    if (index > 0 && ch == ' ')
    {
        ch[index + 1] = '\0';
        PushWords(list, buff);
        index = 0;
    }
}

int main()
{
    LLNode* list = NULL;
    PushWords(&list, "Hello world this is a hello world test bla");
    GetCount(list);
    // release out memery here
}

我刚刚写了这个，所以它可能不会工作 - 但这是一般的想法。

这次用 C++ 编写的另一个草稿（注意：std::map 的搜索时间非常好）：

#include <iostream>
#include <string>
#include <map>

using namespace std;

typedef map<string, int> CountMap;

void PushWords(CountMap& list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    string str;
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            list[str] = list[str] + 1;
            index = 0;
        }
        else if (ch != ' ')
        {
            str += ch;
            index++;
        }
    }

    if (index > 0 && ch == ' ')
    {
        list[str] = list[str] + 1;
    }
}

void PrintCount(CountMap& list)
{
    CountMap::iterator iter = list.begin(), end = list.end();
    for (; iter != end; ++iter)
    {
        cout << (*iter).first << " : " << (*iter).second;
    }
}


int main()
{
    CountMap map;
    PushWords(map, "Hello world this is a hello world test bla");
    PrintCount(map);
}

【讨论】：

我需要一些时间来研究你的代码。顺便感谢您的帮助。你写代码的速度比我打字还快！

【解决方案8】：

#include <conio.h>
#include <iostream.h>
#include <fstream.h>
#include <cstdlib>

struct stdt
{
       char name[20] ;
       int id ;

}; //std

int main()
{
      stdt boy ;
      int a = 0 ;
      ofstream TextFile ;
      cout << "Begin File Creation \n" ;
      TextFile.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      if ( !TextFile)
      {
           cerr <<"Erro 100 Openoing File.DAT" ;
           exit(100);     
      }//end if
      while ( a < 3 )
      {
            TextFile.write( (char*) &boy , sizeof (boy) ) ;
            cout << "\nEnter Name : " ;
            cin  >> boy.name;
            cout << "\nEnter ID : " ;
            cin  >> boy.id ;
            a++;
      }//end while

      TextFile.close();
      cout << "\nEnd File Creation" ;

      ifstream TextFile1 ;
      TextFile1.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      while ( TextFile1.read( (char*) &boy , sizeof (boy) ) )
      {
            cout << "\nEnter Name : " << boy.name; 
            cout << "\nEnter ID : " << boy.id ;


      }// end While

      getch();
      return 0 ;
}//end main

【讨论】：