如何在 R 中使用逻辑回归找到 c-ctatistic 或 AUROC？答案

【问题标题】：How can I find the c-ctatistic or AUROC using logistic regression in R?如何在 R 中使用逻辑回归找到 c-ctatistic 或 AUROC？
【发布时间】：2020-09-26 00:25:21
【问题描述】：

我正在运行逻辑回归以了解这些因素/变量如何影响结果（神经系统并发症）。

如何获得 c 统计量 - 也称为接收器操作特性 (AUROC) 曲线下的面积？

    NeuroLogit2 <- glm(Neurologic Complication? ~ HTN + stroke + Gender + Embol + Drain, data=Tevar.new, family=binomial)
    summary(NeuroLogit2)

【问题讨论】：

标签： r database statistics medical statistical-test

【解决方案1】：

好吧，显然我没有你的数据，所以让我们来弥补一些。在这里，我们假设我们正在根据年龄和性别对人们在任何给定年份感冒的概率进行建模。我们的结果变量只有 1 表示“感冒了”，0 表示“没有感冒”

set.seed(69)
outcome <- c(rbinom(1000, 1, seq(0.4, 0.6, length.out = 1000)),
             rbinom(1000, 1, seq(0.3, 0.5, length.out = 1000)))
sex     <- rep(c("M", "F"), each = 1000)
age     <- rep((601:1600)/20, 2)

df      <- data.frame(outcome, age, sex)

现在我们将创建模型并查看它：

my_mod  <- glm(outcome ~ age + sex, data = df, family = binomial())

summary(my_mod)
#> 
#> Call:
#> glm(formula = outcome ~ age + sex, family = binomial(), data = df)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.3859  -1.0993  -0.8891   1.1847   1.5319  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -1.20917    0.18814  -6.427 1.30e-10 ***
#> age          0.01346    0.00317   4.246 2.18e-05 ***
#> sexM         0.61000    0.09122   6.687 2.28e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2760.1  on 1999  degrees of freedom
#> Residual deviance: 2697.1  on 1997  degrees of freedom
#> AIC: 2703.1
#> 
#> Number of Fisher Scoring iterations: 4

看起来不错。老年人和男性更容易感冒。

现在假设我们想使用这个模型来预测给定年龄和性别的人在明年是否会感冒。如果我们将predict 函数与type = "response" 一起使用，我们会根据年龄和性别对数据框中的每个人进行概率估计。

predictions <- predict(my_mod, type = "response")

我们可以使用这些概率来构建我们的 ROC。这里我将使用 pROC 包来提供帮助：

library(pROC)

roc(outcome, predictions)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.default(response = outcome, predictor = predictions)
#> 
#> Data: predictions in 1079 controls (outcome 0) < 921 cases (outcome 1).
#> Area under the curve: 0.6027

所以 ROC 下的面积是 60.27%。我们可以绘制 ROC 本身来看看它是什么样子的：

library(ggplot2)

ggroc(roc(outcome, predictions)) +
  theme_minimal() + 
  ggtitle("My ROC curve") + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases

^{由reprex package (v0.3.0) 于 2020-06-07 创建}

【讨论】：

由于某种原因，"round(predict(my_mod, type = "response"))" 中的轮函数使一切都为 0，AUC 为 0.5。这应该发生吗？我的变量都是分类变量（1 和 0），不像您使用的“年龄”示例那样连续——也许这会改变一些事情？
@bdg67 您不需要进行二元结果预测。如果您只想要模型的 AUC，则可以删除“圆形”。我已经更新了我的答案 - 如果我让你误入歧途，我深表歉意。
这样好多了！现在我的 AUC 为 0.7951 - 这与 c 统计量相同吗？此外，遇到另一个问题，您可能能够提供一些见解。将其用于另一个变量（死亡率）：当我进入 roc(outcome, predictions) 步骤时，R 给我一个错误，上面写着“响应和预测变量必须是相同长度的向量”。即使它们都是分类的。知道如何解决这个问题吗？
@bdg67 是的，c-statistic 与 AUC 相同。在您的死亡率数据中，您需要对每位患者进行预测，并为同一患者提供测量结果，因此两个向量的长度必须相同：每个患者一个预测和一个结果
是的，我必须修复一些缺失值。最后一件事：我怎样才能找到你所做的逻辑回归的 95% CI 和优势比？当我使用“exp(cbind(OR = coef(NeuroLogit2), confint(NeuroLogit2)))”（对于您的数据，NeuroLogit2 将是 my_mod）运行它们时，我不断得到带有 e 的奇怪值。有什么想法吗？