【问题标题】：test quality of logistic regression model using confusion matrix and roc curve使用混淆矩阵和 roc 曲线检验逻辑回归模型的质量
【发布时间】：2014-09-03 04:15:27
【问题描述】：

我尝试使用逻辑回归模型进行分类。

这是我的工作：

library(ROCR)
data<-read.csv("c:/InsideNetwork.csv");
s1 <- sample(which(data$Active==1),3000)
s2 <- sample(which(data$Active==0),6000)
train <- data[c(s1,s2),]
test  <- data[c(-s1,-s2),]
m<-glm(Active~Var1+Var2+Var3,data=train,family=binomial())
test$score<-predict(m,type="response", test)
pred<-prediction(test$score, test$Active)
perf<-performance(pred,"tpr","fpr")
plot(perf, lty=1)

而且我有很好的 ROC 图，但是如何创建混淆矩阵？

【问题讨论】：

标签： r logistic-regression

【解决方案1】：

使用下面的辅助函数：

pred_df <- data.frame(dep_var = test$Active, score = test$score)
confusion_matrix(pred_df, cutoff = 0.2)

例如，

confusion_matrix(data.frame(score = rank(iris$Sepal.Length)/nrow(iris),
                 dep_var = as.integer(iris$Species != 'setosa')), cutoff = 0.5)

#             score = 0 score = 1
# dep_var = 0        49         1
# dep_var = 1        24        76

辅助函数

#' Plot a confusion matrix for a given prediction set, and return the table.
#'
#' @param dataframe data.frame. Must contain \code{score} and \code{dep_var}
#'    columns. The confusion matrix will be calculated for these values.
#'    The mentioned columns must both be numeric.
#' @param cutoff numeric. The cutoff at which to assign numbers greater a 1
#'    for prediction purposes, and 0 otherwise. The default is 0.5.
#' @param plot.it logical. Whether or not to plot the confusion matrix as a
#'    four fold diagram. The default is \code{TRUE}.
#' @param xlab character. The labels for the rows (\code{dep_var}). The default
#'    is \code{c("dep_var = 0", "dep_var = 1")}.
#' @param ylab character. The labels for the rows (\code{score}). The default
#'    is \code{c("score = 0", "score = 1")}.
#' @param title character. The title for the fourfoldplot, if it is graphed.
#' @return a table. The confusion matrix table.
confusion_matrix <- function(dataframe, cutoff = 0.2, plot.it = TRUE,
                             xlab = c("dep_var = 0", "dep_var = 1"),
                             ylab = c("score = 0", "score = 1"), title = NULL) {
  stopifnot(is.data.frame(dataframe) &&
              all(c('score', 'dep_var') %in% colnames(dataframe)))
  stopifnot(is.numeric(dataframe$score) && is.numeric(dataframe$dep_var))


  dataframe$score <- ifelse(dataframe$score <= cutoff, 0, 1)
  categories <- dataframe$score * 2 + dataframe$dep_var
  confusion <- matrix(tabulate(1 + categories, 4), nrow = 2)
  colnames(confusion) <- ylab
  rownames(confusion) <- xlab
  if (plot.it) fourfoldplot(confusion, color = c("#CC6666", "#99CC99"),
                            conf.level = 0, margin = 1, main = title)
  confusion

}

【讨论】：

非常感谢，完美的解决方案。我的结果是：confusion_matrix(pred_df, cutoff = 0.5) score = 0 score = 1 dep_var = 0 245129 16368 dep_var = 1 626 5130 可以吗？
这取决于您的应用程序。您的结果似乎非常支持Type 1 error。尝试上下移动cutoff，直到左下角和右上角（红色）区域的面积小于其他（绿色）区域。
数据是关于电信网络的。我试图找到愿意离开运营商的用户。我将 cut_off 更改为 0.5，结果如下： score = 0 score = 1 churn = 0 250616 10881 churn = 1 729 5027 我可以这样解释：使用这个模型，我可以找到 16 k 个用户，这些用户被归类为那些谁会离开运营商，其中 30% 会真正做到这一点，将其与随机模型进行比较 ~2% 我认为这是一个很好的结果
被GBM宠坏了，不过你的解释是对的，逻辑回归也不错。 :)
好的，我明白了，您能否向我推荐任何可以了解更好的机器学习算法和模型的阅读材料，这些算法和模型可以用于我的数据？我可以看到你也有波兰血统，我来自波兰：D