Caret 中的 PCA 阈值调整答案

【问题标题】：PCA threshold tuning in CaretCaret 中的 PCA 阈值调整
【发布时间】：2020-09-15 08:51:43
【问题描述】：

我正在尝试使用插入符号从一些数据中构建分类器。我想尝试的一种方法是从用 PCA 预处理的数据中提取一个简单的 LDA。我发现了如何为此使用插入符号：

fitControl <- trainControl("repeatedcv", number=10, repeats = 10,
                                preProcOptions = list(thresh = 0.9))
ldaFit1 <- train(label ~ ., data = tab,
                method = "lda2",
                preProcess = c("center", "scale", "pca"),
                trControl = fitControl)

正如预期的那样，插入符号将 LDA 的准确性与不同的维度值进行比较：

Linear Discriminant Analysis

 158 samples
1955 predictors
   3 classes: '1', '2', '3'

Pre-processing: centered (1955), scaled (1955), principal component
 signal extraction (1955)
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 142, 142, 143, 142, 143, 142, ...
Resampling results across tuning parameters:

  dimen  Accuracy   Kappa
  1      0.5498987  0.1151681
  2      0.5451340  0.1298590

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was dimen = 1.

我想做的是将 PCA 阈值添加到调整参数中，但是我找不到这样做的方法。

插入符号是否有一个简单的解决方案？还是需要使用不同的预处理选项重复训练步骤并最终选择最佳值？

【问题讨论】：

检查这个答案：stackoverflow.com/questions/59452615/… 它包含一个更复杂问题的答案，但是答案包含如何使用 mlr3 包实现对保留的 PCA 组件的所需调整。如果您喜欢插入符号，请使用食谱检查此选项：r-bloggers.com/…

标签： r machine-learning pca r-caret

【解决方案1】：

感谢 misuse 指出的链接，我设法将 PCA 的方差解释阈值集成到参数调整中：

library(caret)
library(recipes)
library(MASS)

# Setting up a vector of thresholds to try out
pca_varex <- c(0.8, 0.9, 0.95, 0.97, 0.98, 0.99, 0.995, 0.999)

# Setting up recipe
initial_recipe <- recipe(train, formula = label ~ .) %>%
                    step_center(all_predictors()) %>%
                    step_scale(all_predictors())

# Define the modelgrid
models <- model_grid() %>%
            share_settings(data = train,
                            trControl = caret::trainControl(method = "repeatedcv",
                                                        number = 10,
                                                        repeats = 10),
                            method = "lda2") 

# Add models with different PCA thresholds
for (i in pca_varex) {
    models <- models %>% add_model(model_name = sprintf("varex_%s", i),
                                    x = initial_recipe %>%
                                        step_pca(all_predictors(), threshold = i))
}

# Train
models <- models %>% train(.)

虽然查找 modelgrid 和 recipes 文档，但 tidymodels 包可能是最直接的方法 (https://www.tidymodels.org/)。

【讨论】：