如何读取文件并打印c中不同单词的数量答案

【问题标题】：How to read a file and print the number of different words in c如何读取文件并打印c中不同单词的数量
【发布时间】：2018-04-24 00:31:26
【问题描述】：

所以这段代码的第一部分是读取任何随机文本文件并打印其中的总字数，我理解，但对于第二部分（??????部分）不同的字数必须打印。不是唯一词的数量，即只出现一次的词，而是不同的词，即唯一词加上每个重复词之一。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX 80

typedef char string[MAX+1];

void main()
{
    char file[MAX], s[MAX];
    int count = 0, i, j;
    FILE *inFile;

    printf("Input file name: ");
    scanf("%s", &file);
    inFile = fopen(file,"r");

    if (inFile == NULL)
    {
        printf("\n\nFile does not exist or cannot be opened.\n");
        exit(1);
    }
    while (fgets(s, MAX, inFile) !=NULL)
    {
        for (i = 0; s[i] != '\0';i++)
            {
            if (s[i] == ' ')
                count++;
        }
    }


    int total= count + 1;
    printf("The total number of words in the file is: %d\n", total);

    ?
    ?
    ?
    ?
    ?
    ?
    ?

    fclose(inFile);
    int different = ?
    printf("The total number of different words in the file is: %d\n", different);
    *
    *
    *

我该如何计算和打印这个？

【问题讨论】：

您需要记录您目前阅读的所有单词。创建一个用计数器存储每个单词的结构，当您阅读一个新单词时，查找该单词是否已被阅读。如果是这种情况，则将计数加一，否则将一个新单词添加到字典中，计数器为 1。简单易行。现在我已经给了你这个想法，试着自己去实现它。
if (s[i] == ' ') count++; 如果文件包含" one \n" 会怎样？或" one two \n"?

标签： c string file

【解决方案1】：

使用HashSet的概念
将你得到的单词放入一个HashSet，然后计算HashSet中的值的数量

【讨论】：

这充其量只是一个评论（当您达到 100 的代表时），它不是问题的答案，并且提出了一个远远超出手头分配范围的哈希表解决方案.这次我不会叮你，但请避免发布 cmets 作为答案。

【解决方案2】：

您必须逐字阅读文件。看到这个answer。

当您阅读单词时，您需要将它们存储在一个数组中。

完成后，数组的大小（非空点的数量）就是字数。

现在 unique 单词的数量有点棘手，但是您现在必须使用嵌套循环来迭代数组（一次一个单词）并使用 strcmp，比较每个单词与数组中的其他单词并计算每个单词在列表中出现的次数。您还必须注意重复的单词。

最后一部分的示例：

char** words = ...; /* Assuming you have read the words into this */
int word, number_of_words = ...; /* Assuming you have number of words */
for (word = 0; word < number_of_words; word++) {
    int i = word + 1;
    unsigned wc = 0;
    while (i < number_of_words) {
        if (strcmp(words[i], words[word]) == 0) {
            wc++;
        }
        i++;
    }
    printf ("Count of \"%s\" is: %u\n", words[word], wc);
}

上面的例子没有考虑重复计数，所以你必须自己处理。

【讨论】：

您引用"answer" 阅读带有fscanf 的单词的唯一问题是您必须进一步处理每个单词以避免将所有格和复数所有格捕获为单词（例如"smac89's" 答案）这个标记化解决方案也将面临类似的挑战。
@DavidC.Rankin 我假设 OP 不会处理此类输入的问题的复杂性较低。我假设他正在阅读的文件由简单的单词组成，除了基本的逗号或句号之外没有任何额外的标点符号，我希望 OP 能够处理所有这些。这个问题似乎并不复杂，但正是大多数编程课程让学生完成的繁重工作的确切类型
我同意你的看法。无论哪种方式，每个单词 fscanf、fgets 和 strtok 或带有几个测试子句的简单 fgetc - 它们都需要一些额外的调整来处理 撇号-s 情况。