为什么这个简单的 Sklearn SVM 关闭了？答案

【问题标题】：Why is this simple Sklearn SVM off?为什么这个简单的 Sklearn SVM 关闭了？
【发布时间】：2018-06-23 19:43:43
【问题描述】：

我一直在研究线性 SVM 的理论，并且在 Python Scikitlearn 中有一个易于使用的理论...举个假设的例子，假设一份咖啡本身就很恶心 - 就像一杯奶油，自然。添加奶油+含糖咖啡似乎很受欢迎，尽管所有开车经过的咖啡馆都证明了这一点。所以这自然会产生一个简单的图表，在 (0,1) 和 (1,0) 之间有一条线，将好的值 (1,1) 分开......但这个简单示例的结果是不准确的：

from __future__ import division
# data points [coffee, cream]:
data = [[ 0,0 ], [ 0,1 ], [ 1,0 ], [ 1,1 ] ]

#Just last one is a positive experience
category = [ -1,  -1,  -1, 1 ]

import numpy
from sklearn.svm import SVC

clf = SVC(kernel='linear')
clf.fit(data, category)

#Get m coefficients:
coef = clf.coef_[0]
b = clf.intercept_[0]

print('This is the M*X+b=0 equation...')
print('M=%s' % (coef))
print('b=%s' % (b))
print('So the equation of the separating line in this 2d svm is:')
print('%f*x + %f*y + %f = 0' % (coef[0],coef[1],b))
print('The support vector limit lines are:')
print('%f*x + %f*y + %f = -1' % (coef[0],coef[1],b))
print('%f*x + %f*y + %f = 1' % (coef[0],coef[1],b))

vertmatrix = [[x] for x in coef]

good = 0
bad = 0
for i, d in enumerate(data):
    #i-th element, d in data:
    calculatedValue = numpy.dot(d, vertmatrix)[0] + b
    print( 'Mx+b for x=%s calculates to %s' % (d, calculatedValue) )
    if calculatedValue > 0 and category[i] > 0:
        good += 1
    elif calculatedValue < 0 and category[i] < 0:
        good += 1
    else:
        bad +=1 #they should have matched category.

print('accuracy=%f' % (good/(good+bad)) )
#The same as the builtin "score" accuracy:
print('accuracy=%f' % clf.score(data, category) )

【问题讨论】：

当前数据不平衡。它有一个类的 75%。所以你需要调整超参数来适应这个问题。也许只是像这样使用class_weight：clf = SVC(kernel='linear',class_weight={-1:1, 1:2})

标签： python numpy scikit-learn svm

【解决方案1】：

另一种方法只是为了添加更多数据。在不改变算法中的任何内容，您可以获得更好的结果：

# data points [coffee, cream]:
data = [[ 0,0 ], [ 0,1 ], [ 1,0 ], [ 1,1 ] ] *5 # 5 times more data

#Just last one is a positive experience
category = [ -1,  -1,  -1, 1 ] * 5

输出将是：

This is the M*X+b=0 equation...
M=[ 2.  2.]
b=-3.0
So the equation of the separating line in this 2d svm is:
2.000000*x + 2.000000*y + -3.000000 = 0
The support vector limit lines are:
2.000000*x + 2.000000*y + -3.000000 = -1
2.000000*x + 2.000000*y + -3.000000 = 1
Mx+b for x=[0, 0] calculates to -3.0
Mx+b for x=[0, 1] calculates to -1.0
Mx+b for x=[1, 0] calculates to -1.0
Mx+b for x=[1, 1] calculates to 1.0
Mx+b for x=[0, 0] calculates to -3.0
Mx+b for x=[0, 1] calculates to -1.0
Mx+b for x=[1, 0] calculates to -1.0
Mx+b for x=[1, 1] calculates to 1.0
Mx+b for x=[0, 0] calculates to -3.0
Mx+b for x=[0, 1] calculates to -1.0
Mx+b for x=[1, 0] calculates to -1.0
Mx+b for x=[1, 1] calculates to 1.0
Mx+b for x=[0, 0] calculates to -3.0
Mx+b for x=[0, 1] calculates to -1.0
Mx+b for x=[1, 0] calculates to -1.0
Mx+b for x=[1, 1] calculates to 1.0
Mx+b for x=[0, 0] calculates to -3.0
Mx+b for x=[0, 1] calculates to -1.0
Mx+b for x=[1, 0] calculates to -1.0
Mx+b for x=[1, 1] calculates to 1.0
accuracy=1.000000
accuracy=1.000000

【讨论】：

为什么甚至有效？相同的数据没有额外的歧视价值并有所作为？你能解释一下吗？ span>
我注意到，至少是添加一个阳性的[1.01,1,01]，使其100％精度。你能暗示数学上的哪个？ span>
添加[1.01, 1.01]是另一个不同的样本。它完全不同。 span>
解释为什么这有效：stats.stackexchange.com/questions/323191/… span>

【解决方案2】：

您应该稍微调整一下参数（例如在这种情况下为 C），而不是只使用默认值 (C=1)，因为它不能满足所有问题：

C_values = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100, 1000]
for C in C_values:
    clf = SVC(kernel='linear', C=C)
    clf.fit(data, category)
    print('For C = {} results = {}'.format(C, clf.predict(data)))

您会看到它在某个时间点后正确分离了数据。

对于 C = 0.0001 结果 = [0 0 0 0]

对于 C = 0.001 结果 = [0 0 0 0]

对于 C = 0.01 结果 = [0 0 0 0]

对于 C = 0.1 结果 = [0 0 0 0]

对于 C = 1 个结果 = [0 0 0 0]

对于 C = 10 个结果 = [0 0 0 1]

对于 C = 100 个结果 = [0 0 0 1]

对于 C = 1000 个结果 = [0 0 0 1]

编辑：

在回复 @AndreyF 的答案时，我什至无法理解（正如我在 cmets 中所说）为什么它会起作用，我向 Cross Validated here 提出了一个问题。

在这里总结一下我的理解是，软边距解决方案中的参数C 表示它将考虑每个样本的多少。因此，当只是一个样本被错误分类时（如上述情况），它不会引起太多关注（或者这种错误分类的惩罚非常小）。当样本数量增加时，惩罚也会增加，这意味着它们会被更多地考虑。

这相当于增加参数C，但我认为操纵C更符合理论。

【讨论】：

"设置 C: C 默认为 1，这是一个合理的默认选择。如果您有很多嘈杂的观察值，您应该减少它。它对应于更多的正则化估计。" - 所以在这种情况下，情况正好相反，这是有道理的......scikit-learn.org/stable/modules/svm.html#svm-classification