【问题标题】:Python code taking more than 15 minutes to generate outputPython 代码需要超过 15 分钟才能生成输出
【发布时间】:2018-10-01 14:39:51
【问题描述】:
import os,re
import math
from math import log10
import nltk.corpus
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from collections import defaultdict
python_file_root = './presidential_debates'

def getidf(token):
    document_occurance = 0
    for filename in os.listdir(python_file_root):
        file = open(os.path.join(python_file_root, filename), "r")
        for line in file:
            if re.search(r'\b' +token+ r'\b', line):
                document_occurance = document_occurance + 1
                break      
    if (document_occurance != 0):
        idf = log10(30 / document_occurance)                   
        return idf
    return -1

def normalize(filename,token):
    file = open(os.path.join(python_file_root, filename), "r")
    counts = dict()
    square = []
    count1 = 0
    for line in file:
        count1 = count1 + 1
        if line in counts:
            counts[line] += 1
        else:
            counts[line] = 1
    for key,value in counts.items():
        tf = 1 +log10(value)
        idf = getidf(key.rstrip())
        square.append((tf * idf)*(tf * idf))
    summ = sum(square)
    sqroot = math.sqrt(summ) 
    return sqroot

def getweight(filename,token):
    hit_count1 = 0
    final = 0
    file = open(os.path.join(python_file_root, filename), "r")
    idft = getidf(token)
    for line in file:
        if re.search(r'\b' +token+ r'\b', line):
            hit_count1 = hit_count1 + 1
    if (hit_count1 == 0):
        return 0
    else:    
        tf = 1 + log10(hit_count1)
    initial = idft * tf
    if(initial <= 0):
        final = 0
        return final
    else:
        normalize_fact = normalize(filename,token)
        final = initial / normalize_fact
        return final  

for filename in os.listdir(python_file_root):
    file = open(os.path.join(python_file_root, filename), "r")
    doc = file.read() 
    doc = doc.lower()
    stemmed = []
    tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
    tokens = tokenizer.tokenize(doc)
    stoplist = stopwords.words('english')
    stop_removed = [word for word in tokens if word not in stoplist]
    with open(os.path.join(python_file_root, filename), "w") as f:
        for item in stop_removed:
            stemmer = PorterStemmer()
            stemmed = [stemmer.stem(item)]
            for items in stemmed:
                f.write("%s\n" % items)
print("\nIDF\n")
print("%.12f" % getidf("health"))
print("%.12f" % getidf("agenda"))
print("%.12f" % getidf("vector"))
print("%.12f" % getidf("reason"))
print("%.12f" % getidf("hispan"))
print("%.12f" % getidf("hispanic"))
print("\n")
print("%.12f" % getweight("2012-10-03.txt","health"))
print("%.12f" % getweight("1960-10-21.txt","reason"))
print("%.12f" % getweight("1976-10-22.txt","agenda"))
print("%.12f" % getweight("2012-10-16.txt","hispan"))
print("%.12f" % getweight("2012-10-16.txt","hispanic"))

我有 30 个 txt 文件,并且我开发了一个程序来查找 idf 和规范化的 tf-idf 向量。我得到了正确的值,但函数 getweight 需要超过 15 分钟才能生成输出。谁能给我一些优化的方法。 我不想使用任何其他非标准的 Python 包。

【问题讨论】:

  • 我相信这可能更适合codereview.stackexchange.com
  • 没问题,但您可以将 hit_count1 = hit_count1 = 1 更改为 hit_count1 += 1
  • 您需要首先分析您的脚本并确定它在哪里花费了大量时间。这可以使用标准 Python 库来完成。见How can you profile a script?

标签: python performance optimization data-mining tf-idf


【解决方案1】:

为什么要为每个单词创建一个 PorterStemmer?

除了这个显而易见的事情,尝试分析你的代码。 NLTI 以速度很慢而著称——所以这很可能不是你的错。如果您配置文件,那么您就会知道。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-11-15
    • 2018-05-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-02-13
    • 1970-01-01
    • 2010-10-19
    相关资源
    最近更新 更多