【问题标题】:MemoryError with large .txt file Python when when counting and ranking words对单词进行计数和排序时,带有大 .txt 文件 Python 的 MemoryError
【发布时间】:2018-12-09 10:05:49
【问题描述】:

我正在尝试从一个包含芬兰语文本的 500mb 文本文件创建一个排序的单词列表 csv 文件。该脚本将对小文件执行我想要的操作,但不适用于 500mb 的野兽。

我是 Python 的完全初学者,如果写得很草率请见谅。环顾四周,我想我可能必须逐行处理文件。

with open(...) as f:
    for line in f:
    # Do something with 'line' 

我会很感激任何指点,干杯!我的代码如下:

#load text
filename = 'finnish_text.txt'
file = open(filename, 'r')
text = file.read()
file.close()

#lowercase and split words by white space
lowercase = text.lower()
words = lowercase.split()

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]

# ranked word count specify return amount here
from collections import Counter
Counter = Counter(stripped)
most_occur = Counter.most_common(100)

# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    for x in most_occur:
        writer.writerow(x)

编辑: 我最终使用了@Bharel 在他的评论中给出的第二个解决方案(多么传奇)。由于编码问题,我不得不更改几行。

with open(filename, 'r', encoding='Latin-1', errors='replace') as file:

with open('word_rank.csv', 'w', newline='', errors='replace') as csvfile:

【问题讨论】:

    标签: python python-3.x text


    【解决方案1】:

    将所有内容切换到生成器,它应该可以工作:

    #load text
    filename = 'finnish_text.txt'
    # Auto-close when done
    with open(filename, 'r') as file:
    
        #lowercase and split words by white space
        word_iterables =(text.lower().split() for line in file)
    
        # remove punctuation from each word
        import string
        table = str.maketrans('', '', string.punctuation)
    
        stripped = (w.translate(table) for it in word_iterables for w in it)
    
        # ranked word count specify return amount here
        from collections import Counter
        counter = Counter(stripped)
    
    most_occur = counter.most_common(100)
    
    # export csv file
    import csv
    with open('word_rank.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        for x in most_occur:
            writer.writerow(x)
    

    通过使用生成器(括号而不是方括号),单词都被延迟处理,而不是一次全部加载到内存中。


    如果你想要最有效的方法,我写了一个自我挑战:

    import itertools
    import operator
    
    #load text
    filename = 'finnish_text.txt'
    # Auto-close when done
    with open(filename, 'r') as file:
    
        # Lowercase the lines
        lower_lines = map(str.lower, file)
    
        # Split the words in each line - will return [[word, word], [word, word]]
        word_iterables = map(str.split, lower_lines)
    
        # Combine the iterables:
        # i.e. [[word, word], [word, word]] -> [word, word, word, word]
        words = itertools.chain.from_iterable(word_iterables)
    
        import string
        table = str.maketrans('', '', string.punctuation)
    
        # remove punctuation from each word
        stripped = map(operator.methodcaller("translate", table), words)
    
        # ranked word count specify return amount here
        from collections import Counter
        counter = Counter(stripped)
    
    most_occur = counter.most_common(100)
    
    # export csv file
    import csv
    with open('word_rank.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        for x in most_occur:
            writer.writerow(x)
    

    它充分利用了用 C 编写的生成器(map 和 itertools)。

    【讨论】:

    • 请不要做Counter = Counter(stripped);这使得Counter 从类变为所述类的实例,这让希望它成为类的维护者感到困惑。 PEP8 命名只需要命名变量counter(假设不存在更具体的名称)。
    • @ShadowRanger 哦,没注意到。从OP的代码中获取它。谢谢先生。游侠:P
    • 为什么不从counter 更改为像count_dict 这样更具描述性的东西?授予 Countercounter 不同,但让它更明显没有害处。
    • @jpp 这已经取决于 OP。我不想弄乱他的变量名,以防他将它们用于以后的代码。
    • @Bharel,公平点,提及可疑做法也是一种很好的做法 (IMO) :)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-06-02
    • 1970-01-01
    • 2021-12-12
    • 1970-01-01
    • 2012-05-23
    • 2019-11-19
    • 2020-08-04
    相关资源
    最近更新 更多