SVM 可视化非常随机且不准确答案

【问题标题】：SVM visualization really random and inaccurateSVM 可视化非常随机且不准确
【发布时间】：2019-01-03 08:44:16
【问题描述】：

def vec(utterance): 
    embedder = UtteranceEmbedder(utterance)
    word2vec = embedder.as_word2vec()
    bow = embedder.as_bow_vec()
    ret = np.concatenate([word2vec, bow])
    return np.pad(ret, [0, 500-len(ret)], "constant")

op = OptionParser()
op.add_option(
    "-f", "--file", help="path to file containing utterances to visualize",
    action="store", type="string", dest="path"
)

(opt, args) = op.parse_args()
if opt.path is None or (opt.path is not None and len(opt.path)) == 0:
    op.error("path to file containing newline separated utterances must be specified")

vectors = []
with open(opt.path) as f:
    content = f.readlines()
    # you may also want to remove whitespace characters like `\n` at the end of each line
    for utterance in [x.strip() for x in content]:
        vectors.append(vec(utterance))


vectors_reduced = TSNE(n_components=2).fit_transform(np.array(vectors))
X=np.array(vectors_reduced)
y=np.array([0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1])
clf = svm.SVC(decision_function_shape='ovo',class_weight="balanced)
clf.fit(X, y)

话语将是一个短语，我将对话语进行标记，从 Google 300B 训练模型中提取 word2vec 向量，附加一袋词向量并拟合数据。以下是我的训练数据：输入.txt

yea
yeah
yaa
say
ok
okay
no
nope
not interested
dont
cant
cannot
not now
not really
not at the moment
no thank you
sorry no
sorry
not active

正如您所看到的，这是一个简单的对立案例，当我使用 matplotlib 绘制点时，我得到尽可能随机而不是线性可分的。

发生这种完全不准确的情况会是什么情况。

【问题讨论】：

标签： machine-learning scikit-learn svm svc

【解决方案1】：

您正在使用带有标准参数的 svc。特别是你的内核是一个 rbf 内核（也称为 gaußian）。 SVM 的工作原理有点过于复杂，无法在此处发布，而使用内核则更加复杂。如果你有兴趣，我可以向你推荐麻省理工学院的讲座。

https://www.youtube.com/watch?v=_PwhiWxHK8o

但简而言之，您的高斯内核的分离仍然是线性的，但在更高维的向量空间中。它将用于分离的数据转换到该空间中，并用超平面将其线性分离。如果您稍后使用支持向量在二维中可视化数据，则分离不是线性的，而是在您的“内核向量空间”中。

顺便说一句。你也应该考虑standardization。

【讨论】：