线性模型汇总表的错误输出答案

【问题标题】：Wrong output from linear model summary table线性模型汇总表的错误输出
【发布时间】：2021-03-06 08:22:05
【问题描述】：

假设我想对 mtcars 数据集做一个线性模型回归

library(ggplot2)
library(ggpmisc)

mtcars
linear_model = y~x

ggplot(mtcars, aes(disp, drat)) +
  geom_point() +
  geom_smooth(method = "lm",formula= linear_model) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  theme_bw()+
  facet_wrap(~cyl) +
  stat_poly_eq(
    aes(label = paste(stat(adj.rr.label), stat(eq.label),sep = "*\",    \"*")),
    formula = linear_model, rr.digits = 2, parse = TRUE,size=3)

现在我想总结在表格中获得的数据变量 - 特别是我对斜率感兴趣。我尝试了以下方法：

table_mtcars <- mtcars %>%
  nest_by(cyl) %>% 
  summarise(mdl = list(lm(log10(disp) ~ log10(drat), data)), .groups = "drop") %>% 
  mutate(adjrsquared = map_dbl(mdl, ~summary(.)$adj.r.squared ),
         mdl = map(mdl, broom::tidy)) %>% 
  unnest(mdl)%>%
  filter(term=="log10(drat)")

当数据没有进行对数转换时效果很好，但是当数据被对数转换时，表中的估计值是错误的。有人知道为什么吗？

【问题讨论】：

请记住minimal reproducible example 指南的最小部分。如果这个问题不是关于绘图的，我们不需要 10 行绘图代码，可以专注于计算模型系数的问题，包括您描述的转换数据和未转换数据之间的差异

标签： r tidyverse linear-regression

【解决方案1】：

broom 包及其 tidy 和 glance 函数在这里可能很有用：

library(tidyverse)
library(broom)

dat = mtcars %>% 
  nest_by(cyl) %>%
  mutate(model = list(lm(log10(disp) ~ log10(drat), data)),
         coefficients = list(tidy(model)),
         statistics = list(glance(model)))

coefficients = dat %>% unnest(coefficients)
statistics = dat %>% unnest(statistics)

coefficients
#> # A tibble: 6 x 9
#> # Groups:   cyl [3]
#>     cyl        data model  term  estimate std.error statistic p.value statistics
#>   <dbl> <list<tbl_> <list> <chr>    <dbl>     <dbl>     <dbl>   <dbl> <list>    
#> 1     4   [11 × 10] <lm>   (Int…    2.97      0.524     5.66  3.10e-4 <tibble […
#> 2     4   [11 × 10] <lm>   log1…   -1.57      0.860    -1.83  1.01e-1 <tibble […
#> 3     6    [7 × 10] <lm>   (Int…    2.93      0.206    14.2   3.12e-5 <tibble […
#> 4     6    [7 × 10] <lm>   log1…   -1.22      0.372    -3.28  2.20e-2 <tibble […
#> 5     8   [14 × 10] <lm>   (Int…    2.59      0.255    10.2   3.00e-7 <tibble […
#> 6     8   [14 × 10] <lm>   log1…   -0.102     0.501    -0.203 8.43e-1 <tibble […

statistics
#> # A tibble: 3 x 16
#> # Groups:   cyl [3]
#>     cyl      data model coefficients r.squared adj.r.squared  sigma statistic
#>   <dbl> <list<tb> <lis> <list>           <dbl>         <dbl>  <dbl>     <dbl>
#> 1     4 [11 × 10] <lm>  <tibble [2 …   0.271          0.190  0.102     3.35  
#> 2     6  [7 × 10] <lm>  <tibble [2 …   0.682          0.619  0.0562   10.7   
#> 3     8 [14 × 10] <lm>  <tibble [2 …   0.00341       -0.0796 0.0846    0.0410
#> # … with 8 more variables: p.value <dbl>, df <dbl>, logLik <dbl>, AIC <dbl>,
#> #   BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>

仅斜坡：

coefficients %>% 
  filter(term == "log10(drat)") %>%
  select(cyl, term, estimate, p.value)
#> # A tibble: 3 x 4
#> # Groups:   cyl [3]
#>     cyl term        estimate p.value
#>   <dbl> <chr>          <dbl>   <dbl>
#> 1     4 log10(drat)   -1.57   0.101 
#> 2     6 log10(drat)   -1.22   0.0220
#> 3     8 log10(drat)   -0.102  0.843

编辑：关于您的 cmets，我现在看到您的两个代码块正在做不同的事情。在您的ggplot2 中，您估计一个线性模型，然后更改绘图的轴。在第二部分中，您记录变量然后估计线性模型。第一个是纯线性模型，您只需更改图形表示。第二个是“lin-log 模型”。

希望这张图能帮助你看出区别：

dat <- mtcars

mod_lin <- lm(mpg ~ hp, dat)
mod_log <- lm(mpg ~ log10(hp), dat)
dat$pred_lin <- predict(mod_lin)
dat$pred_log <- predict(mod_log)

par(mfrow=c(2,2))
with(dat, plot(hp, pred_lin,
     main="lin model; lin axis"))
with(dat, plot(hp, pred_lin, log="x",
     main="lin model; log axis"))
with(dat, plot(hp, pred_log,
     main="log model; lin axis"))
with(dat, plot(hp, pred_log, log="x",
     main="log model; log axis"))

【讨论】：

感谢您的评论@Vincent！您的代码实际上给出了与我相同的结果，但是您的解决方案更优雅。我确实很差地表达了我的问题，但问题是表中的估计值（斜率值）（-1.57、-1.22、-0.10）与我从图表中的统计数据中得到的斜率值不匹配：y =0.954 - 0.172x，y=1.81 - 0.56x，y=0.592 - 0.0336x。我不确定哪些斜率值是正确的。
感谢@Vincent 的更新答案！假设我使用您的代码从 LM 中提取估计值，但没有“log10”，例如：... mutate(model = list(lm(disp ~ drat, data)) ... 在系数表中给出以下估计值：-36.79921、-72.55967、-16.77288。如果我在没有对数转换的情况下执行ggplot2，我会得到以下公式：y=4.79 -0.00681x，y=5.33 - 0.00952x，y=3.41 - 0.000506。它们不应该是相同的吗？
未转换的ggplot2 将是：linear_model = y~x ggplot(mtcars, aes(disp, drat)) + geom_point() + geom_smooth(method = "lm",formula= linear_model) + facet_wrap(~cyl) + stat_poly_eq( aes(label = paste(stat(adj.rr.label), stat(eq.label),sep = "*\", \"*")), formula = linear_model, rr.digits = 2, parse = TRUE,size=3)
问题似乎出在ggpmisc 包中的stat_poly_eq 函数中。请注意，如果您绘制完全相同的图但注释掉 scale_*_continuous 行，那么您将得到不同的估计值。这是非常令人惊讶的行为。为什么改变图形表示的比例会改变底层模型？我的直觉是这是一个错误，但我不知道ggpmisc 足以确定。