【发布时间】:2019-05-28 15:14:42
【问题描述】:
以下可重现的脚本用于计算带有 gensim 中 W2VTransformer 包装器的 Word2Vec 分类器的准确度:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import W2VTransformer
from gensim.utils import simple_preprocess
# Load synthetic data
data = pd.read_csv('https://pastebin.com/raw/EPCmabvN')
data = data.head(10)
# Set random seed
np.random.seed(0)
# Tokenize text
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
# Get labels
y_train = data.label
train_input = [x[0] for x in X_train]
# Train W2V Model
model = W2VTransformer(size=10, min_count=1)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1)
clf.fit(model.transform(train_input), y_train)
text_w2v = Pipeline(
[('features', model),
('classifier', clf)])
score = text_w2v.score(train_input, y_train)
score
0.80000000000000004
这个脚本的问题是它仅在train_input = [x[0] for x in X_train] 时起作用,它本质上总是只有第一个单词。
一旦更改为train_input = X_train(或train_input 只需替换为X_train),脚本就会返回:
ValueError: 无法将大小为 10 的数组重塑为形状 (10,10)
我该如何解决这个问题,即分类器如何处理多个输入词?
编辑:
显然,与 D2V 相比,W2V 包装器无法使用可变长度的火车输入。这是一个有效的 D2V 版本:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess, lemmatize
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(dm=1, size=50, min_count=2, iter=10, seed=0)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1, random_state=0)
clf.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
y_pred = pipeline.predict(X_train)
score = accuracy_score(y_train,y_pred)
print(score)
【问题讨论】:
-
在哪里脚本返回
ValueError? (如果您可以显示完整的错误堆栈,则更容易查看问题所在,因此您应该编辑问题以包含额外的详细信息。) -
re: your update 是的,
W2VTransformer不会将可变长度的单词列表折叠成单个向量,因为这不是自动需要的功能包装的Word2Vec模型。相反,它将可变长度的单词列表转换为相同长度的向量列表。如果您需要将它们折叠为单个向量以供后续步骤使用,您可以将其实现为以下转换器,也许是一个将所有向量平均在一起的转换器。 (这通常是一种简单的基线方法,但根据您的数据和目标,其他权重或算法可能效果更好。)
标签: scikit-learn gensim word2vec