【发布时间】:2018-12-09 10:05:49
【问题描述】:
我正在尝试从一个包含芬兰语文本的 500mb 文本文件创建一个排序的单词列表 csv 文件。该脚本将对小文件执行我想要的操作,但不适用于 500mb 的野兽。
我是 Python 的完全初学者,如果写得很草率请见谅。环顾四周,我想我可能必须逐行处理文件。
with open(...) as f:
for line in f:
# Do something with 'line'
我会很感激任何指点,干杯!我的代码如下:
#load text
filename = 'finnish_text.txt'
file = open(filename, 'r')
text = file.read()
file.close()
#lowercase and split words by white space
lowercase = text.lower()
words = lowercase.split()
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
# ranked word count specify return amount here
from collections import Counter
Counter = Counter(stripped)
most_occur = Counter.most_common(100)
# export csv file
import csv
with open('word_rank.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for x in most_occur:
writer.writerow(x)
编辑: 我最终使用了@Bharel 在他的评论中给出的第二个解决方案(多么传奇)。由于编码问题,我不得不更改几行。
with open(filename, 'r', encoding='Latin-1', errors='replace') as file:
with open('word_rank.csv', 'w', newline='', errors='replace') as csvfile:
【问题讨论】:
标签: python python-3.x text