【问题标题】:Why the following partial fit is not working property?为什么以下部分拟合不起作用?
【发布时间】:2017-09-11 07:47:22
【问题描述】:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

您好,我有以下 cmets 列表:

comments = ['I am very agry','this is not interesting','I am very happy']

这些是对应的标签:

sents = ['angry','indiferent','happy']

我正在使用 tfidf 对这些 cmets 进行矢量化,如下所示:

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing

我正在使用标签编码器对标签进行矢量化:

le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

这里我使用被动攻击来拟合模型:

clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

在这里,我尝试使用三个新的 cmets 及其相应的标签来测试部分拟合的用法,如下所示:

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)

print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)

问题是部分拟合后我没有得到正确的结果,如下所示:

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

但是我得到了这个输出:

[2 2 2]

因此,我非常感谢您提供的支持,如果我使用与过去训练过的相同示例对其进行测试,那么为什么模型没有更新,所需的输出应该是:

[1,0,2]

感谢您对调整超参数以查看所需输出的支持。

这是完整的代码,显示部分拟合:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random


comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

clf2 = PassiveAggressiveClassifier()

clf2.fit(tfidf, labels)


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)



with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]

vec_new_comments = tfidf_vectorizer.transform(new_comments)

clf2.partial_fit(vec_new_comments, new_labels)



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

但是我得到了:

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]

【问题讨论】:

  • 你如何适应clf2。请将整个代码作为一个代码 sn-p 发布。现在一次又一次地复制粘贴非常烦人。
  • @VivekKumar 我已经更新了问题,我添加了完整的代码来重现我的问题,感谢支持

标签: machine-learning scikit-learn


【解决方案1】:

您的代码存在多个问题。我将首先将显而易见的内容陈述为更复杂的内容:

  1. 您在clf2 还没学会任何东西之前就对其进行了腌制。 (即。一旦定义它就腌制它,它没有任何用途)。如果你只是测试,那很好。否则,它们应该在fit() 或等效调用之后被腌制。
  2. 您在clf2.partial_fit() 之前调用clf2.fit()。这违背了partial_fit() 的全部目的。当您调用fit() 时,您实际上修复了模型将学习的类(标签)。在您的情况下,这是可以接受的,因为在您随后致电 partial_fit() 时,您将给出相同的标签。但这仍然不是一个好习惯。

    See this for more details

    在 partial_fit() 场景中,永远不要调用 fit()。始终使用您的起始数据和新数据调用partial_fit()。但请确保在参数classes 中首次调用parital_fit() 时提供您希望模型学习的所有标签。

  3. 现在是最后一部分,关于您的tfidf_vectorizer。您使用comments 数组在tfidf_vectorizer 上调用fit_transform()(本质上是fit(),然后是transformed())。这意味着它在随后调用transform() 时(就像你在transform(new_comments) 中所做的那样),它不会从new_cmets 学习新单词,而只会使用它在调用fit() 期间看到的单词(@ 中出现的单词987654343@).

    LabelEncodersents 也是如此。

    这在在线学习场景中也不可取。您应该一次拟合所有可用数据。但是由于您尝试使用partial_fit(),我们假设您有非常大的数据集,可能无法立即放入内存。所以你也想对 TfidfVectorizer 应用某种 partial_fit。但是 TfidfVectorizer 不支持partial_fit()。事实上,它不是为大数据而设计的。所以你需要改变你的方法。有关详细信息,请参阅以下问题:-

除此之外,如果您仅更改拟合整个数据的 tfidf 部分(commentsnew_comments 一次),您将获得所需的结果。

看下面的代码改动(我可能稍微整理了一下,把vec_new_comments改名为new_tfidf,请注意看):

comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']

new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()

# The below lines are important

# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents) 

以下是不太喜欢的代码(您正在使用,我在第 2 点中谈到过),但只要您进行上述更改,结果就会很好。

tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)

clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]

new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)

clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]     As you wanted

正确的方法,或 partial_fit() 的使用方式:

# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only

all_classes = le.transform(le.classes_)

# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]

# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

【讨论】:

  • 非常感谢支持我终于克服了这种情况
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多