R火车，svmRadial“无法缩放数据”答案

【问题标题】：R train, svmRadial "Cannot scale data"R火车，svmRadial“无法缩放数据”
【发布时间】：2020-11-04 12:21:45
【问题描述】：

我正在使用 R 和这个 breastCancer 数据框。我想在包caret 中使用函数train，但由于以下错误，它不起作用。但是，当我使用另一个数据框时，该功能可以正常工作。

library(mlbench)
library(caret)

data("breastCancer")
BC = na.omit(breastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")

这是错误：

error : In .local(x, ...) : Variable(s) `' 常量。无法缩放数据。

【问题讨论】：

什么是 BC.train？你没有在问题中提到。
对不起。实际上这不是 BC.train 而是 BC。我忘记修改了。
好像是在mllbench包中定义了train()函数。这是我在运行您的代码后得到的错误：Error in train(Class ~ ., data = as.matrix(BC), method = "svmRadial") : could not find function "train"
train() 函数不是在melbench 包中定义而是在caren 包中，所以要使用这个函数，你必须安装caren 包，虽然我没有在问题中编写此代码。
@nima 我在运行您的代码时确实会收到带有相同消息的警告。来自caret 包的train 的替代方法是使用来自e1071 包的svm。它对我来说很好，没有警告。

标签： r svm r-caret

【解决方案1】：

我们可以从您拥有的数据开始：

library(mlbench)
library(caret)

data(BreastCancer)
BC = na.omit(BreastCancer[,-1])

str(BC)

'data.frame':   683 obs. of  10 variables:
 $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
 $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
 $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
 $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
 $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
 $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
 $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
 $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
 $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
 $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

BC 是一个 data.frame，您可以看到所有预测变量都是分类的或有序的。你正在尝试做一个 svmRadial，意思是一个带有radial basis function 的 svm。计算类别特征之间的欧几里得距离并不是那么简单，如果您查看类别的分布：

sapply(BC,table)
$Cl.thickness

  1   2   3   4   5   6   7   8   9  10 
139  50 104  79 128  33  23  44  14  69 

$Cell.size

  1   2   3   4   5   6   7   8   9  10 
373  45  52  38  30  25  19  28   6  67 

$Cell.shape

  1   2   3   4   5   6   7   8   9  10 
346  58  53  43  32  29  30  27   7  58 

$Marg.adhesion

  1   2   3   4   5   6   7   8   9  10 
393  58  58  33  23  21  13  25   4  55

当您训练模型时，默认情况下它是引导程序，您的一些训练数据将丢失低表示的级别，例如上表中Marg.adhesion 的类别 9。并且此变量在此训练中全为零，因此会引发错误。它很可能不会对整体结果产生太大影响（因为它们很少见）。

一种解决方案是使用交叉验证（您不太可能在测试折叠中选择所有罕见的观察结果）。请注意，当您有一个带有因子和字符的 data.frame 时，您永远不应该使用 as.matrix() 转换为矩阵。 Caret 可以像这样处理 data.frame：

train(Class ~.,data=BC,method="svmRadial",trControl=trainControl(method="cv"))
Support Vector Machines with Radial Basis Function Kernel 

683 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 614, 615, 615, 615, 616, 615, ... 
Resampling results across tuning parameters:

  C     Accuracy   Kappa    
  0.25  0.9575654  0.9101995
  0.50  0.9619346  0.9190284
  1.00  0.9633838  0.9220161

Tuning parameter 'sigma' was held constant at a value of 0.01841092
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01841092 and C = 1.

如果您想使用 bootstrap 进行交叉验证，另一种选择是忽略这些低类的观察，或者将它们与其他类结合起来。

【讨论】：

【解决方案2】：

您的代码包含一些拼写错误，例如包名称是 caret 而不是 caren，数据集名称是 BreastCancer 而不是 breastCancer。您可以使用以下代码摆脱错误

library(mlbench)
library(caret)

data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")

它返回我

#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 683 samples
#>   9 predictor
#>   2 classes: 'benign', 'malignant' 
#> 
#> No pre-processing
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 683, 683, 683, 683, 683, 683, ... 
#> Resampling results across tuning parameters:
#> 
#>   C     Accuracy   Kappa    
#>   0.25  0.9550137  0.9034390
#>   0.50  0.9585504  0.9107666
#>   1.00  0.9611485  0.9161541
#> 
#> Tuning parameter 'sigma' was held constant at a value of 0.02349173
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 0.02349173 and C = 1.

【讨论】：