我们可以从您拥有的数据开始:
library(mlbench)
library(caret)
data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
str(BC)
'data.frame': 683 obs. of 10 variables:
$ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
$ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
$ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
$ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
$ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
$ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
$ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
$ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
$ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
$ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
BC 是一个 data.frame,您可以看到所有预测变量都是分类的或有序的。你正在尝试做一个 svmRadial,意思是一个带有radial basis function 的 svm。计算类别特征之间的欧几里得距离并不是那么简单,如果您查看类别的分布:
sapply(BC,table)
$Cl.thickness
1 2 3 4 5 6 7 8 9 10
139 50 104 79 128 33 23 44 14 69
$Cell.size
1 2 3 4 5 6 7 8 9 10
373 45 52 38 30 25 19 28 6 67
$Cell.shape
1 2 3 4 5 6 7 8 9 10
346 58 53 43 32 29 30 27 7 58
$Marg.adhesion
1 2 3 4 5 6 7 8 9 10
393 58 58 33 23 21 13 25 4 55
当您训练模型时,默认情况下它是引导程序,您的一些训练数据将丢失低表示的级别,例如上表中Marg.adhesion 的类别 9。并且此变量在此训练中全为零,因此会引发错误。它很可能不会对整体结果产生太大影响(因为它们很少见)。
一种解决方案是使用交叉验证(您不太可能在测试折叠中选择所有罕见的观察结果)。请注意,当您有一个带有因子和字符的 data.frame 时,您永远不应该使用 as.matrix() 转换为矩阵。 Caret 可以像这样处理 data.frame:
train(Class ~.,data=BC,method="svmRadial",trControl=trainControl(method="cv"))
Support Vector Machines with Radial Basis Function Kernel
683 samples
9 predictor
2 classes: 'benign', 'malignant'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 614, 615, 615, 615, 616, 615, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.9575654 0.9101995
0.50 0.9619346 0.9190284
1.00 0.9633838 0.9220161
Tuning parameter 'sigma' was held constant at a value of 0.01841092
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01841092 and C = 1.
如果您想使用 bootstrap 进行交叉验证,另一种选择是忽略这些低类的观察,或者将它们与其他类结合起来。