【发布时间】:2021-03-25 10:15:06
【问题描述】:
Sklearn 在其 TFIDF 矢量化器版本的实现中几乎没有调整,因此要复制确切的结果,您需要在自定义的 tfidf 矢量化器实现中添加以下内容:
-
Sklearn 的词汇表从 idf 生成,按字母顺序排列
-
idf 的 Sklearn 公式不同于标准的教科书公式。这里将常数“1”添加到 idf 的分子和分母中,就好像看到一个额外的文档恰好包含集合中的每个术语一次,这样可以防止零除法。 IDF(t)=1+(loge((1 + 集合中的文档总数)/(1+包含术语 t 的文档数))。
-
Sklearn 将 L2 归一化应用于其输出矩阵。
-
sklearn tfidf vectorizer的最终输出是一个稀疏矩阵。
我尝试不使用库实现它,但遇到了我无法调试的错误。
代码:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
def fit(dataset):
unique_words = set() # at first we will initialize an empty set
# check if its list type or not
if isinstance(dataset, (list)):
for document in dataset: # for each review in the dataset
for word in document.split(" "): # for each word in the review.#split method converts a string into list of words
if len(word) < 2:
continue
unique_words.add(word)
unique_words = sorted(list(unique_words))
vocab = {j:i for i,j in enumerate(unique_words)}
return vocab
else:
print("you need to pass list of sentance")
vocab=fit(corpus)
print(vocab)
output:{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}
def idf(unique_words):
idf_dict={}
N=len(corpus)
for i in unique_words:
count=0
for row in corpus:
if i in row.split():
count+=1
idf_dict[i]=float(1+math.log((N+1)/(count+1)))
return idf_dict
def transform(dataset,vocab):
rows = []
columns = []
values = []
if isinstance(dataset, (list,)):
for idx, row in enumerate(dataset): # for each document in the dataset
# it will return a dict type object where key is the word and values is its frequency {word:frequency}
word_freq = dict(Counter(row.split()))
# for every unique word in the document
for word, freq in word_freq.items(): # for each unique word in the review.
if len(word) < 2:
continue
# we will check if its there in the vocabulary that we build in fit() function
# dict.get() function will return the values, if the key doesn't exits it will return -1
col_index = vocab.get(word, -1) # retrieving the dimension number of a word
# if the word exists
if col_index !=-1:
# we are storing the index of the document
rows.append(idx)
# we are storing the dimensions of the word
columns.append(col_index)
td = freq/float(len(rows)) # the number of times a word occured in a document
idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
values.append((td) * (idf_))
return normalize(csr_matrix( ((values),(row,columns)), shape=(len(dataset),len(vocab))),norm='l2' )
else:
print("you need to pass list of strings")
print(transform(corpus,vocab))
错误:
TypeError Traceback (most recent call last)
<ipython-input-20-8da73617fb69> in <module>()
----> 1 print(transform(corpus,vocab))
22 td = freq/float(len(rows)) # the number of times a word occured in a document
23 a=idf(word)
---> 24 idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
25 values.append((td) * (idf_))
26
TypeError: unsupported operand type(s) for +: 'int' and 'dict_values'
【问题讨论】:
-
欢迎来到 SO;在单独的 sn-ps(已编辑)中发布您的代码和错误是个好主意。
-
很可能最后你会想要将你的结果与 sklearn 的 TfidfVectorizer 的结果进行比较,所以你可以考虑使用他们的默认正则表达式标记器而不是
document.split(" ")。
标签: python python-3.x dictionary machine-learning nlp