【问题标题】:How can i write the name of text file before frequency of each word?如何在每个单词的频率之前写下文本文件的名称?
【发布时间】:2018-11-16 06:25:57
【问题描述】:

如何在每个单词频率中写入文本文件名,以便它首先显示文件号,然后显示该文件中单词的频率。 例如: { 喜欢:['file1',2,'file2,'4'] } 这里 like 是两个文件都包含的单词,我想在它们的频率之前写 file1 和 file2。 它应该适用于任意数量的文件。

这是我的代码

file_list = [open(file, 'r') for file in files] 
    num_files = len(file_list) 
    wordFreq = {}  
    for i, f in enumerate(file_list): 
        for line in f: 
            for word in line.lower().split():
                if not word in wordFreq:
                    wordFreq[word] = [0 for _ in range(num_files)]
                wordFreq[word][i] += 1

【问题讨论】:

    标签: python python-3.x dictionary frequency word-frequency


    【解决方案1】:

    我知道我的代码不是很漂亮,也不是你想要的,但它是一个解决方案。我更喜欢使用字典而不是像['file1',2,'file2,'4']这样的列表结构@

    我们以定义2个文件为例:

    file1.txt:

    this is an example
    

    file2.txt:

    this is an example
    but multi line example
    

    解决办法如下:

    from collections import Counter
    
    filenames = ["file1.txt", "file2.txt"]
    
    # First, find word frequencies in files
    file_dict = {}
    for filename in filenames:
        with open(filename) as f:
            text = f.read()
        words = text.split()
    
        cnt = Counter()
        for word in words:
            cnt[word] += 1
        file_dict[filename] = dict(cnt)
    
    print("file_dict: ", file_dict)
    
    #Then, calculate frequencies in files for each word 
    word_dict = {}
    for filename, words in file_dict.items():
        for word, count in words.items():
            if word not in word_dict.keys():
                word_dict[word] = {filename: count}
            else:
                if filename not in word_dict[word].keys():
                    word_dict[word][filename] = count    
                else:
                    word_dict[word][filename] += count
    
    
    print("word_dict: ", word_dict)
    

    输出:

    file_dict:  {'file1.txt': {'this': 1, 'is': 1, 'an': 1, 'example': 1}, 'file2.txt': {'this': 1, 'is': 1, 'an': 1, 'example': 2, 'but': 1, 'multi': 1, 'line': 1}}
    word_dict:  {'this': {'file1.txt': 1, 'file2.txt': 1}, 'is': {'file1.txt': 1, 'file2.txt': 1}, 'an': {'file1.txt': 1, 'file2.txt': 1}, 'example': {'file1.txt': 1, 'file2.txt': 2}, 'but': {'file2.txt': 1}, 'multi': {'file2.txt': 1}, 'line': {'file2.txt': 1}}
    

    【讨论】:

      【解决方案2】:

      这是collections.Counter 的一个很好的用例;我建议为每个文件制作一个计数器。

      from collections import Counter
      
      def make_counter(filename):
          cnt = Counter()
      
          with open(filename) as f:
              for line in f:                # read line by line, is more performant for big files
                  cnt.update(line.split())  # split line by whitespaces and updated word counts
      
          print(filename, cnt)
          return cnt
      

      这个函数可以用于每个文件,创建一个包含所有计数器的dict

      filename_list = ['f1.txt', 'f2.txt', 'f3.txt']
      counter_dict = {                      # this will hold a counter for each file
          fn: make_counter(fn)
          for fn in filename_list}
      

      现在set 可用于获取文件中出现的所有不同单词:

      all_words = set(                      # this will hold all different words that appear
          word                              # in any of the files
          for cnt in counter_dict.values()
          for word in cnt.keys())
      

      这些行打印每个单词以及该单词在每个文件中的计数:

      for word in sorted(all_words):
          print(word)
          for fn in filename_list:
              print('  {}: {}'.format(fn, counter_dict[fn][word]))
      

      显然,您可以根据自己的特定需求调整打印,但这种方法应该可以让您获得所需的灵活性。


      如果您宁愿有一个 dict 将所有单词作为键并将它们的计数作为值,您可以尝试这样的操作:

      all_words = {}
      
      for fn, cnt in counter_dict.items():
          for word, n in cnt.items():
              all_words.setdefault(word, {}).setdefault(fn, 0)
              all_words[word][fn] += 0
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2016-01-23
        • 1970-01-01
        • 1970-01-01
        • 2013-02-02
        相关资源
        最近更新 更多