为什么以下部分拟合不起作用？答案

【问题标题】：Why the following partial fit is not working property?为什么以下部分拟合不起作用？
【发布时间】：2017-09-11 07:47:22
【问题描述】：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

您好，我有以下 cmets 列表：

comments = ['I am very agry','this is not interesting','I am very happy']

这些是对应的标签：

sents = ['angry','indiferent','happy']

我正在使用 tfidf 对这些 cmets 进行矢量化，如下所示：

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing

我正在使用标签编码器对标签进行矢量化：

le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

这里我使用被动攻击来拟合模型：

clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

在这里，我尝试使用三个新的 cmets 及其相应的标签来测试部分拟合的用法，如下所示：

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)

print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)

问题是部分拟合后我没有得到正确的结果，如下所示：

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

但是我得到了这个输出：

[2 2 2]

因此，我非常感谢您提供的支持，如果我使用与过去训练过的相同示例对其进行测试，那么为什么模型没有更新，所需的输出应该是：

[1,0,2]

感谢您对调整超参数以查看所需输出的支持。

这是完整的代码，显示部分拟合：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random


comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

clf2 = PassiveAggressiveClassifier()

clf2.fit(tfidf, labels)


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)



with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]

vec_new_comments = tfidf_vectorizer.transform(new_comments)

clf2.partial_fit(vec_new_comments, new_labels)



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

但是我得到了：

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]

【问题讨论】：

你如何适应clf2。请将整个代码作为一个代码 sn-p 发布。现在一次又一次地复制粘贴非常烦人。
@VivekKumar 我已经更新了问题，我添加了完整的代码来重现我的问题，感谢支持

标签： machine-learning scikit-learn

【解决方案1】：

您的代码存在多个问题。我将首先将显而易见的内容陈述为更复杂的内容：

您在clf2 还没学会任何东西之前就对其进行了腌制。（即。一旦定义它就腌制它，它没有任何用途）。如果你只是测试，那很好。否则，它们应该在fit() 或等效调用之后被腌制。
您在clf2.partial_fit() 之前调用clf2.fit()。这违背了partial_fit() 的全部目的。当您调用fit() 时，您实际上修复了模型将学习的类（标签）。在您的情况下，这是可以接受的，因为在您随后致电 partial_fit() 时，您将给出相同的标签。但这仍然不是一个好习惯。

See this for more details

在 partial_fit() 场景中，永远不要调用 fit()。始终使用您的起始数据和新数据调用partial_fit()。但请确保在参数classes 中首次调用parital_fit() 时提供您希望模型学习的所有标签。
现在是最后一部分，关于您的tfidf_vectorizer。您使用comments 数组在tfidf_vectorizer 上调用fit_transform()（本质上是fit()，然后是transformed()）。这意味着它在随后调用transform() 时（就像你在transform(new_comments) 中所做的那样），它不会从new_cmets 学习新单词，而只会使用它在调用fit() 期间看到的单词（@ 中出现的单词987654343@).

LabelEncoder 和 sents 也是如此。

这在在线学习场景中也不可取。您应该一次拟合所有可用数据。但是由于您尝试使用partial_fit()，我们假设您有非常大的数据集，可能无法立即放入内存。所以你也想对 TfidfVectorizer 应用某种 partial_fit。但是 TfidfVectorizer 不支持partial_fit()。事实上，它不是为大数据而设计的。所以你需要改变你的方法。有关详细信息，请参阅以下问题：-
- Updating the feature names into scikit TFIdfVectorizer
- How can i reduce memory usage of Scikit-Learn Vectorizers?

除此之外，如果您仅更改拟合整个数据的 tfidf 部分（comments 和 new_comments 一次），您将获得所需的结果。

看下面的代码改动（我可能稍微整理了一下，把vec_new_comments改名为new_tfidf，请注意看）：

comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']

new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()

# The below lines are important

# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)

以下是不太喜欢的代码（您正在使用，我在第 2 点中谈到过），但只要您进行上述更改，结果就会很好。

tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)

clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]

new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)

clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]     As you wanted

正确的方法，或 partial_fit() 的使用方式：

# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only

all_classes = le.transform(le.classes_)

# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]

# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

【讨论】：

非常感谢支持我终于克服了这种情况