【发布时间】:2019-01-03 08:44:16
【问题描述】:
def vec(utterance):
embedder = UtteranceEmbedder(utterance)
word2vec = embedder.as_word2vec()
bow = embedder.as_bow_vec()
ret = np.concatenate([word2vec, bow])
return np.pad(ret, [0, 500-len(ret)], "constant")
op = OptionParser()
op.add_option(
"-f", "--file", help="path to file containing utterances to visualize",
action="store", type="string", dest="path"
)
(opt, args) = op.parse_args()
if opt.path is None or (opt.path is not None and len(opt.path)) == 0:
op.error("path to file containing newline separated utterances must be specified")
vectors = []
with open(opt.path) as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
for utterance in [x.strip() for x in content]:
vectors.append(vec(utterance))
vectors_reduced = TSNE(n_components=2).fit_transform(np.array(vectors))
X=np.array(vectors_reduced)
y=np.array([0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1])
clf = svm.SVC(decision_function_shape='ovo',class_weight="balanced)
clf.fit(X, y)
话语将是一个短语,我将对话语进行标记,从 Google 300B 训练模型中提取 word2vec 向量,附加一袋词向量并拟合数据。 以下是我的训练数据: 输入.txt
yea
yeah
yaa
say
ok
okay
no
nope
not interested
dont
cant
cannot
not now
not really
not at the moment
no thank you
sorry no
sorry
not active
正如您所看到的,这是一个简单的对立案例,当我使用 matplotlib 绘制点时,我得到尽可能随机而不是线性可分的。
发生这种完全不准确的情况会是什么情况。
【问题讨论】:
标签: machine-learning scikit-learn svm svc