【发布时间】:2011-08-23 14:01:32
【问题描述】:
我有一个包含 5 行的制表符分隔行块的文本文件:
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
1 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
2 \t DESCRIPTION \t SENTENCE \t ITEMS
等等
在每个块中,DESCRIPTION 和 SENTENCE 列是相同的。感兴趣的数据在ITEMS列中,对于chunk中的每一行都是不同的,格式如下:
word1, word2, word3
...等等
对于每5行的chunk,我需要统计ITEMS中word1、word2等出现的频率。例如,如果前 5 行块如下
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
1 \t DESCRIPTION \t SENTENCE \t word4
1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3
1 \t DESCRIPTION \t SENTENCE \t word1, word2
那么这个 5 行块的正确输出将是
1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)
即,块号后跟句子,然后是单词的频率计数。
我有一些代码来提取五行块并在提取后计算块中单词的频率,但我坚持隔离每个块的任务,获取单词频率,继续下一个,等等
from itertools import groupby
def GetFrequencies(file):
file_contents = open(file).readlines() #file as list
"""use zip to get the entire file as list of 5-line chunk tuples"""
five_line_increments = zip(*[iter(file_contents)]*5)
for chunk in five_line_increments: #for each 5-line chunk...
for sentence in chunk: #...and for each sentence in that chunk
words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
words_no_comma = [x.strip(',') for x in words] #get rid of the commas
words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas
"""STUCK HERE The idea originally was to take the words lists for
each chunk and combine them to create a big list, 'collection,' and
feed this into the for-loop below."""
for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
print key,len(list(group)),
【问题讨论】: