【问题标题】:Extract items from n-line chunks in a file, count frequency of items for each chunk, Python从文件中的 n 行块中提取项目,计算每个块的项目频率,Python
【发布时间】:2011-08-23 14:01:32
【问题描述】:

我有一个包含 5 行的制表符分隔行块的文本文件:

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

等等

在每个块中,DESCRIPTION 和 SENTENCE 列是相同的。感兴趣的数据在ITEMS列中,对于chunk中的每一行都是不同的,格式如下:

word1, word2, word3

...等等

对于每5行的chunk,我需要统计ITEMS中word1、word2等出现的频率。例如,如果前 5 行块如下

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

 1 \t DESCRIPTION \t SENTENCE \t word4

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

那么这个 5 行块的正确输出将是

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

即,块号后跟句子,然后是单词的频率计数。

我有一些代码来提取五行块并在提取后计算块中单词的频率,但我坚持隔离每个块的任务,获取单词频率,继续下一个,等等

from itertools import groupby 

def GetFrequencies(file):
    file_contents = open(file).readlines()  #file as list
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments:  #for each 5-line chunk... 
        for sentence in chunk:          #...and for each sentence in that chunk
            words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
            words_no_comma = [x.strip(',') for x in words]  #get rid of the commas
            words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas


       """STUCK HERE   The idea originally was to take the words lists for 
       each chunk and combine them to create a big list, 'collection,' and
       feed this into the for-loop below."""





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
        print key,len(list(group)),    

【问题讨论】:

    标签: python text-processing


    【解决方案1】:

    使用python 2.7

    #!/usr/bin/env python
    
    import collections
    
    chunks={}
    
    with open('input') as fd:
        for line in fd:
            line=line.split()
            if not line:
                continue
            if chunks.has_key(line[0]):
                for i in line[3:]:
                    chunks[line[0]].append(i.replace(',',''))
            else:
                chunks[line[0]]=[line[2]]
    
    for k,v in chunks.iteritems():
        counter=collections.Counter(v[1:])
        print k, v[0], counter
    

    输出:

    1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})
    

    【讨论】:

    • 由于时间紧迫,无法更新到 2.7,但这是一段不错的代码
    【解决方案2】:

    标准库中有一个 csv 解析器可以为您处理输入拆分

    import csv
    import collections
    
    def GetFrequencies(file_in):
        sentences = dict()
        with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file:
            for line in csv_file:
                sentence = line[0]
                if sentence not in sentences:
                    sentences[sentence] = collections.Counter()
                sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])
    

    【讨论】:

      【解决方案3】:

      稍微编辑了您的代码,我认为它可以满足您的要求:

      file_contents = open(file).readlines()  #file as list
      """use zip to get the entire file as list of 5-line chunk tuples""" 
      five_line_increments = zip(*[iter(file_contents)]*5) 
      for chunk in five_line_increments:  #for each 5-line chunk...
          word_freq = {} #word frequencies for each chunk
          for sentence in chunk:          #...and for each sentence in that chunk
              words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list
              for word in words:
                  if word not in word_freq:
                      word_freq[word] = 1
                  else:
                      word_freq[word] += 1
      
      
          print word_freq
      

      输出:

      {'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}
      

      【讨论】:

        【解决方案4】:

        总结一下:如果不是“DESCRIPTION”或“SENTENCE”,您想将所有“单词”附加到集合中吗?试试这个:

        for word in words_no_ws:
            if word not in ("DESCRIPTION", "SENTENCE"):
                collection.append(word)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2010-10-27
          • 1970-01-01
          • 1970-01-01
          • 2011-02-15
          • 2021-06-01
          • 1970-01-01
          • 2021-09-09
          • 2016-08-12
          相关资源
          最近更新 更多