为什么我的程序没有输出正确的字数？答案

【问题标题】：Why is my program not outputting the right count of words?为什么我的程序没有输出正确的字数？
【发布时间】：2014-12-03 03:03:41
【问题描述】：

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>

struct thread_data {
  FILE *fp;
  long int offset;
  int start;
  int blockSize;
  //struct word maybe?
};

int words = 0;

void *countFrequency(void* data) {
  struct thread_data* td = data;
  char *buffer = malloc(td->blockSize);

  int i, c;
  i = 0; c = 0;
  enum states { WHITESPACE, WORD };
  int state = WHITESPACE;

  fseek(td->fp, td->offset, td->start);

  char last = ' ';
  while ((fread(buffer, td->blockSize, 1, td->fp)) == 1) {
    if (buffer[0]== ' ' || buffer[0] == '\t') {
      state = WHITESPACE;
    } else if (buffer[0] == '\n') {
      //newLine++;
      state = WHITESPACE;
    } else {
      if (state == WHITESPACE) {
        words++;
      }
      state = WORD;
    }
    last = buffer[0];
  }

  free(buffer);
  pthread_exit(NULL);
  return NULL;
}

int main(int argc, char **argv) {

  int nthreads, x, id, blockSize, len;
  //void *state;
  FILE *fp;
  pthread_t *threads;

  fp = fopen("file1.txt", "r");

  printf("Enter the number of threads: ");
  scanf("%d", &nthreads);
  struct thread_data data[nthreads];
  threads = malloc(nthreads * sizeof(pthread_t));

  fseek(fp, 0, SEEK_END);
  len = ftell(fp);
  printf("len= %d\n", len);

  blockSize = (len + nthreads - 1) / nthreads;
  printf("size= %d\n", blockSize);

  for (id = 0; id < nthreads; id++) {
    data[id].fp = fp;
    data[id].offset = blockSize;
    data[id].start = id * blockSize + 1;
    //maybe data[id]. word struct
  }
  //LAST THREAD
  data[nthreads-1].start=(nthreads-1)*blockSize+1;

  for (id = 0; id < nthreads; id++)
    pthread_create(&threads[id], NULL, &countFrequency,&data[id]);

  for (id = 0; id < nthreads; id++)
    pthread_join(threads[id],NULL);

  fclose(fp);

  printf("%d\n",words);
  return 0;
}

我在这个程序中修复了一个分段错误，但现在当我运行它时，我得到 0 个单词，这是不正确的，因为文本文件中有大约一百万个单词。

谁能告诉我为什么它给我的字数不正确？

【问题讨论】：

标签： c struct pthreads

【解决方案1】：

您遇到的一个问题是您在每个countFrequency 线程中使用相同的文件描述符，每个线程执行一次fseek，然后尝试循环读取。最后一个fseek 获胜。

必须首先解决这个设计缺陷。

【讨论】：

@jgabb 您在 countFrequency() 中的 fseek 不正确...查看手册页的第三个参数。
至于countFrequency中的文件描述符，指向同一个文件，不应该是同一个文件描述符吗？
@jgabb 想象一下，3个线程...第一个从偏移量1开始，第二个从偏移量3001开始，第三个从偏移量6001开始。每个线程读取3000字节。线程一执行 fseek() 以偏移 1，但在它可以读取之前，线程 2 接管。然后它执行 fseek() 以偏移 3001，但在它可以读取之前，线程 3 接管。线程 3 fseek() 到 6001... 现在线程 1 再次获得控制权，它认为它位于偏移量 1，而实际上它位于偏移量 6001...您看到问题了吗...您需要同步访问，或将整个文件读入内存并以这种方式处理。
换句话说，您将文件描述符视为“每个线程”资源，而实际上它是“每个进程”资源。
啊，我的印象是每个线程都有自己的 fseek 实例（如果这有意义，我不知道如何措辞）并且它不会影响它根本没有。我无法将整个文件读入内存，因为它应该是一个巨大的文件（> 2GB），所以我试图用部分读入它。我想我可以设置一个互斥锁。这行得通吗？