R中的循环尝试逻辑回归拟合中的不同特征以找到最佳auc分数？答案

【问题标题】：Loop in R trying different features in logistic regression fit to find best auc score?R中的循环尝试逻辑回归拟合中的不同特征以找到最佳auc分数？
【发布时间】：2020-11-01 19:39:53
【问题描述】：

我正在使用 Iris 数据集来拟合逻辑回归。我想拟合每种特征组合，看看我能得到什么以获得最佳 AUC 分数。

例如，我将适合 4 * 3 * 2 * 1 = 24 个模型。这本质上是对每个特征组合的置换。我想把它输出到一个表格中，看看哪个组合给了我最好的分数。

数据集的前 3 行

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

拟合一个模型

这里只是拟合其中一个模型并获得 AUC

## make it binary classification

library(ROCR)
library(tidyverse)
iris.small <- iris %>%
  filter(Species %in% c("virginica", "versicolor"))

is.na(iris.small$Species) <- iris.small$Species == "setosa"
iris.small$Species <- factor(iris.small$Species)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(iris.small))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(iris.small)), size = smp_size)

train <- iris.small[train_ind, ]
test <- iris.small[-train_ind, ]

mod1 <- glm(Species ~ .,
  data = train,
  family = binomial(link = "logit")
)

pred_probs <- predict(mod1, newdata = test, type = "response")


pred_obj <- ROCR::prediction(pred_probs, test$Species)
perf_obj <- ROCR::performance(pred_obj, measure = "tpr", x.measure = "fpr")

auc <- performance(pred_obj, measure = "auc")
auc <- auc@y.values[[1]]

print(auc)

预期输出 每个拟合的 AUC 分数表。会有两列，拟合中的特征和 AUC 分数。

另外，一般来说这是一个好主意吗？拟合 24 个模型可能看起来不是很理想，但我不确定如何确定哪些特征组合是最优化的。

感谢您的帮助。

【问题讨论】：

为什么是 24 个模型？如果你有4个特征，不应该是model1：4个特征，model2：3个特征，model3：2个特征和model4：1个特征？
我会为每个单独的功能做一个拟合：这将是 4 个模型。然后拟合两个特征的每个组合，依此类推。

标签： r loops machine-learning tidyverse logistic-regression

【解决方案1】：

你可以试试这个。我列出了我得到的组合。

library(plyr)
#Index
a1 <- as.data.frame(t(combn(1:4,1)))
a2 <- as.data.frame(t(combn(1:4,2)))
a3 <- as.data.frame(t(combn(1:4,3)))
a4 <- as.data.frame(t(combn(1:4,4)))
#Cols contains the combinations
Cols <- do.call(rbind.fill,list(a1,a2,a3,a4))

那么我们有办法通过索引来分配变量：

   V1 V2 V3 V4
1   1 NA NA NA
2   2 NA NA NA
3   3 NA NA NA
4   4 NA NA NA
5   1  2 NA NA
6   1  3 NA NA
7   1  4 NA NA
8   2  3 NA NA
9   2  4 NA NA
10  3  4 NA NA
11  1  2  3 NA
12  1  2  4 NA
13  1  3  4 NA
14  2  3  4 NA
15  1  2  3  4

下一个代码计算你需要的：

## make it binary classification
library(ROCR)
library(tidyverse)
iris.small <- iris %>%
  filter(Species %in% c("virginica", "versicolor"))
is.na(iris.small$Species) <- iris.small$Species == "setosa"
iris.small$Species <- factor(iris.small$Species)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(iris.small))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(iris.small)), size = smp_size)
train <- iris.small[train_ind, ]
test <- iris.small[-train_ind, ]
#Now create a list
List <- list()
#Iterate over all posible values of Cols
for(i in 1:dim(Cols)[1])
{
  #Variables
  #Define target and covariates
  vals <- as.vector(na.omit(t(Cols[i,])))
  target <- "Species"
  vars <- names(iris.small)[vals]
  #Create formulas
  string <- paste(target,'~', paste(vars, collapse= "+"))
  fmla <- as.formula(paste(target,'~', paste(vars, collapse= "+")))
  #Model 
  mod1 <- glm(formula = fmla,
              data = train,
              family = binomial(link = "logit"))
  pred_probs <- predict(mod1, newdata = test, type = "response")
  
  
  pred_obj <- ROCR::prediction(pred_probs, test$Species)
  perf_obj <- ROCR::performance(pred_obj, measure = "tpr", x.measure = "fpr")
  
  auc <- performance(pred_obj, measure = "auc")
  auc <- auc@y.values[[1]]
  #Save results
  output <- data.frame(Model=string,auc=auc)
  #Feed into list
  List[[i]] <- output
}
#Format as dataframe
DFResult <- do.call(rbind,List)

你会得到：

                                                         Model       auc
1                                       Species ~ Sepal.Length 0.7852564
2                                        Species ~ Sepal.Width 0.5128205
3                                       Species ~ Petal.Length 0.9903846
4                                        Species ~ Petal.Width 1.0000000
5                           Species ~ Sepal.Length+Sepal.Width 0.7403846
6                          Species ~ Sepal.Length+Petal.Length 1.0000000
7                           Species ~ Sepal.Length+Petal.Width 1.0000000
8                           Species ~ Sepal.Width+Petal.Length 0.9935897
9                            Species ~ Sepal.Width+Petal.Width 1.0000000
10                          Species ~ Petal.Length+Petal.Width 1.0000000
11             Species ~ Sepal.Length+Sepal.Width+Petal.Length 1.0000000
12              Species ~ Sepal.Length+Sepal.Width+Petal.Width 1.0000000
13             Species ~ Sepal.Length+Petal.Length+Petal.Width 1.0000000
14              Species ~ Sepal.Width+Petal.Length+Petal.Width 1.0000000
15 Species ~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width 1.0000000

【讨论】：