【问题标题】:What is the true positive rate when the false positive rate is 0.5 in the model?模型中假阳性率为0.5时真阳性率是多少?
【发布时间】:2019-04-07 19:15:56
【问题描述】:

我试图了解当模型中的 FPR 为 0.5 时如何计算真阳性率,然后生成 ROc 曲线。但我肯定在编码方面遇到了一些问题......

> library(nycflights13)
> late_arrival<- flights$arr_delay>50
> summary(late_arrival)
   Mode   FALSE    TRUE    NA's 
logical  275847   51499    9430 
> late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month+year, data=flights, family='binomial')

警告信息: glm.fit: fitted probabilities numerically 0 or 1 occurred

> summary(late_arrival.lr)
Call:
glm(formula = late_arrival ~ carrier + dep_delay + month + year, 
    family = "binomial", data = flights)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0972  -0.2445  -0.1920  -0.1570   3.9217  

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.9122786  0.0430834 -90.807  < 2e-16 ***
carrierAA    0.2174443  0.0485813   4.476 7.61e-06 ***
carrierAS   -0.3549507  0.2540636  -1.397  0.16239    
carrierB6    0.5142442  0.0428985  11.987  < 2e-16 ***
carrierDL    0.2228855  0.0449833   4.955 7.24e-07 ***
carrierEV    0.3230899  0.0431394   7.489 6.92e-14 ***
carrierF9    1.1544420  0.1444764   7.991 1.34e-15 ***
carrierFL    0.7190162  0.0812251   8.852  < 2e-16 ***
carrierHA   -0.2276957  0.4115495  -0.553  0.58008    
carrierMQ    0.8086500  0.0475393  17.010  < 2e-16 ***
carrierOO    1.0138755  0.9037621   1.122  0.26193    
carrierUA    0.0919203  0.0431571   2.130  0.03318 *  
carrierUS    0.6063731  0.0525429  11.541  < 2e-16 ***
carrierVX   -0.0485832  0.0852892  -0.570  0.56893    
carrierWN   -0.1551747  0.0574042  -2.703  0.00687 ** 
carrierYV    0.5737826  0.1999578   2.870  0.00411 ** 
dep_delay    0.1000536  0.0004308 232.263  < 2e-16 ***
month        0.0009126  0.0024337   0.375  0.70767    
year                NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284924  on 327345  degrees of freedom
Residual deviance: 108708  on 327328  degrees of freedom
AIC: 108744

Number of Fisher Scoring iterations: 7

它不断向我显示此警告:(Dispersion parameter for binomial family taken to be 1)

我实际上如何从这里预测条件? 我知道我必须以某种方式产生预测值和实际值才能达到真正的阳性率。任何人都可以指导我吗? 非常感谢!

【问题讨论】:

    标签: r roc


    【解决方案1】:

    从您的模型中删除 year,因为它没有变化,重新拟合模型,然后将 flights 作为 newdata 参数传递给模型的 predict() 方法。

    例如,使用来自ROC 的维基百科页面中的术语和缩写:

    library(nycflights13)
    
    late_arrival<- flights$arr_delay>50
    late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month, data=flights, family='binomial')
    #> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    fit <- predict(late_arrival.lr, newdata = flights, type = "response")
    
    d <- data.frame(late_arrival, fit)
    
    # "Confusion matrix" of actual vs predicted outcomes
    # for a cutpoint of 0.5:
    xtabs(~ late_arrival + I(fit > 0.5), data = d)
    #>             I(fit > 0.5)
    #> late_arrival  FALSE   TRUE
    #>        FALSE 290637   3091
    #>        TRUE    7386  26232
    
    # Now do this for a range of cutpoints.
    # Sensitivity = true positive rate = TPR
    # Specificity = true negative rate = TNR
    # 1 - Specificity = false positive rate = FPR = 1 - TNR
    # The ROC plot is
    #     x = 1 - Specificity = FPR
    #     y = Sensitivity     = TPR
    
    fun <- function(cutpoint) {
        pred <- d$fit > cutpoint
        # cm = "confusion matrix"
        cm <- xtabs(~ late_arrival + I(fit > cutpoint), data = d)
        cm <- as.list(cm)
        names(cm) <- c("TN", "FN", "FP", "TP")
        sens <- with(cm, TP / (TP + FN))
        spec <- with(cm, TN / (TN + FP))
        return(data.frame(cutpoint, sens, spec))
    }
    
    # Example output:
    fun(0.5)
    #>   cutpoint      sens      spec
    #> 1      0.5 0.7802963 0.9894767
    
    cutpoints <- seq(0.02, 0.98, by = 0.02)
    # This does
    # rbind(fun(cutpoints[1]), fun(cutpoints[2], ...)
    roc <- do.call(rbind, lapply(cutpoints, fun))
    plot(1 - roc$spec, roc$sens, type = "b",
         xlab = "False positive rate (1 - specificity)", 
         ylab = "True positive rate (sensitivity)",
         xlim = c(0, 1),
         ylim = c(0, 1))
    

    reprex package (v0.2.1.9000) 于 2019 年 4 月 7 日创建

    请注意,在回答您的主要问题之前需要解决几个问题:

    您的示例中year 的影响估计为NA,因为该变量没有变化,因此无法估计它的影响。

    > unique(flights$year)
    [1] 2013
    

    如果你放弃这个预测器并重新拟合,输出是有意义的(意味着没有 NA 或巨大的标准误差):

    > late_arrival.lr <- glm(late_arrival~carrier+dep_delay+month, data=flights, family='binomial')
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    > coef(summary(late_arrival.lr))
                    Estimate   Std. Error     z value     Pr(>|z|)
    (Intercept) -5.325540101 0.0564526220 -94.3364527 0.000000e+00
    carrierAA    0.335139676 0.0622536491   5.3834543 7.306979e-08
    carrierAS   -0.980666348 0.3701250164  -2.6495544 8.059801e-03
    carrierB6    0.524971196 0.0542918253   9.6694335 4.066226e-22
    carrierDL    0.406813418 0.0576767561   7.0533339 1.746810e-12
    carrierEV    0.350366432 0.0535144496   6.5471370 5.865056e-11
    carrierF9    0.776012126 0.2084826127   3.7221911 1.975015e-04
    carrierFL    0.773647203 0.1077982499   7.1768067 7.135846e-13
    carrierHA   -2.225896541 0.8684691013  -2.5630118 1.037685e-02
    carrierMQ    0.847415433 0.0601677914  14.0842037 4.749822e-45
    carrierOO    0.232324503 1.3043323784   0.1781176 8.586307e-01
    carrierUA    0.157191477 0.0549977051   2.8581461 4.261241e-03
    carrierUS    0.649304471 0.0697493204   9.3091154 1.289014e-20
    carrierVX    0.237994726 0.1131585684   2.1031967 3.544858e-02
    carrierWN    0.032542799 0.0736491439   0.4418626 6.585887e-01
    carrierYV    0.861814625 0.2373042135   3.6316870 2.815745e-04
    dep_delay    0.089655081 0.0004428296 202.4595603 0.000000e+00
    month        0.005089147 0.0032449949   1.5683066 1.168096e-01
    

    警告fitted probabilities numerically 0 or 1 occurred 通常意味着结果完全可以由您的连续值预测器之一预测。例如:

    > x <- c(1, 2, 3)
    > y <- c(0, 0, 1)
    > coef(summary(glm(y ~ x, family="binomial")))
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
                  Estimate Std. Error       z value  Pr(>|z|)
    (Intercept) -115.57626  226884.08 -0.0005094067 0.9995936
    x             46.34447   94156.73  0.0004922056 0.9996073
    

    这里,最好的估计是

    P(y = 1) = (0 if x

    但这引发了两个数值问题:

    • P(y = 1) 与 x 的通常sigmoid 曲线现在应该是step function。这需要一个无限陡峭的 sigmoid,因此相对于 x 的“斜率”趋于无穷大。
    • 介于 2 和 3 之间的任何阈值都同样适用,因此无法确定截距的最佳估计值。

    然而,在flights 的情况下,我认为警告就是它所说的意思:一些预测是如此确定,以至于任何细微差别都会在舍入误差中丢失。



    在检查 late_arrival 是否确实可以通过一个 x 变量完美预测时,我使用了以下代码:

    # Make warnings print as they appear.
    # options() returns the previous settings, and we store it
    warn <- options(warn = 1)$warn
    for (i in c("carrier", "dep_delay", "month", "year")) {
      print(i)
      glm(late_arrival~flights[[i]], family='binomial')
    }
    # Restore the previous warning setting
    options(warn = warn)
    

    打印出来的

    [1] "carrier"
    [1] "dep_delay"
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    [1] "month"
    [1] "year"
    

    但是plot(flights$dep_delay, late_arrival)(需要几秒钟)表明实际上没有完全分离,所有late_arrival都发生在dep_delay>某个阈值。

    【讨论】:

    • 非常感谢。但是我需要从这个数据 TPR 中看到 FPR 为 0.5 的模型中:late_arrival~carrier+dep_delay+month,以及生成 ROC 曲线,而不仅仅是绘图模型。你能给我一个线索吗?
    猜你喜欢
    • 2022-01-17
    • 1970-01-01
    • 2018-11-12
    • 2018-08-21
    • 2022-12-29
    • 2017-01-25
    • 1970-01-01
    • 2020-09-22
    • 2016-02-03
    相关资源
    最近更新 更多