如何在不使用 python 库的情况下实现 Tf-idf？答案

【问题标题】：How to Implement Tf-idf without using libraries in python?如何在不使用 python 库的情况下实现 Tf-idf？
【发布时间】：2021-03-25 10:15:06
【问题描述】：

Sklearn 在其 TFIDF 矢量化器版本的实现中几乎没有调整，因此要复制确切的结果，您需要在自定义的 tfidf 矢量化器实现中添加以下内容：

Sklearn 的词汇表从 idf 生成，按字母顺序排列
idf 的 Sklearn 公式不同于标准的教科书公式。这里将常数“1”添加到 idf 的分子和分母中，就好像看到一个额外的文档恰好包含集合中的每个术语一次，这样可以防止零除法。 IDF(t)=1+(loge((1 + 集合中的文档总数)/(1+包含术语 t 的文档数))。
Sklearn 将 L2 归一化应用于其输出矩阵。
sklearn tfidf vectorizer的最终输出是一个稀疏矩阵。

我尝试不使用库实现它，但遇到了我无法调试的错误。

代码：

corpus = [
         'this is the first document',
         'this document is the second document',
         'and this is the third one',
         'is this the first document',
         ]
  

def fit(dataset):    
    unique_words = set() # at first we will initialize an empty set
    # check if its list type or not
    if isinstance(dataset, (list)):
        for document in dataset: # for each review in the dataset
            for word in document.split(" "): # for each word in the review.#split method converts a string into list of words
                if len(word) < 2:
                    continue
                unique_words.add(word)
        unique_words = sorted(list(unique_words))
        vocab = {j:i for i,j in enumerate(unique_words)}
        
        return vocab
    else:
        print("you need to pass list of sentance")

vocab=fit(corpus)
print(vocab)
output:{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}

def idf(unique_words):
    idf_dict={}
    N=len(corpus)
    for i in unique_words:
        count=0
        for row in corpus:
            if i in row.split():
                count+=1

        idf_dict[i]=float(1+math.log((N+1)/(count+1)))

    return idf_dict

def transform(dataset,vocab):
    rows = []
    columns = []
    values = []
    if isinstance(dataset, (list,)):
        for idx, row in enumerate(dataset): # for each document in the dataset
            # it will return a dict type object where key is the word and values is its frequency {word:frequency}
            word_freq = dict(Counter(row.split()))
            # for every unique word in the document
            for word, freq in word_freq.items():  # for each unique word in the review.                
                if len(word) < 2:
                    continue
                # we will check if its there in the vocabulary that we build in fit() function
                # dict.get() function will return the values, if the key doesn't exits it will return -1
                col_index = vocab.get(word, -1) # retrieving the dimension number of a word
                # if the word exists
                if col_index !=-1:
                    # we are storing the index of the document
                    rows.append(idx)
                    # we are storing the dimensions of the word
                    columns.append(col_index)
                    td = freq/float(len(rows)) # the number of times a word occured in a document
                    idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
                    values.append((td) * (idf_))
                    
        return normalize(csr_matrix( ((values),(row,columns)), shape=(len(dataset),len(vocab))),norm='l2' )
    else:
        print("you need to pass list of strings")

print(transform(corpus,vocab))

错误：

 TypeError                                 Traceback (most recent call last)
    <ipython-input-20-8da73617fb69> in <module>()
    ----> 1 print(transform(corpus,vocab))
    
    
         22                     td = freq/float(len(rows)) # the number of times a word occured in a document
         23                     a=idf(word)
    ---> 24                     idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
         25                     values.append((td) * (idf_))
         26 
    
    TypeError: unsupported operand type(s) for +: 'int' and 'dict_values'

【问题讨论】：

欢迎来到 SO；在单独的 sn-ps（已编辑）中发布您的代码和错误是个好主意。
很可能最后你会想要将你的结果与 sklearn 的 TfidfVectorizer 的结果进行比较，所以你可以考虑使用他们的默认正则表达式标记器而不是 document.split(" ")。

标签： python python-3.x dictionary machine-learning nlp

【解决方案1】：

idf(word) -> dict

该函数 idf 返回一个字典。 idf 似乎接受了语料库，所以在函数的前面调用它，然后访问您想要获取的单词。

tmp_dict = idf(corpus)

...
idf_ = 1+math.log((1+len(dataset))/float(1+tmp_dict[word]))

【讨论】：