使用效果编码重新调整因子和 glm答案

【问题标题】：Relevel factor and glm with effect coding使用效果编码重新调整因子和 glm
【发布时间】：2020-02-24 14:18:15
【问题描述】：

我无法理解 glm 的效果编码。举个例子：

data('mpg')
mpg$trans = as.factor(mpg$trans)
levels(mpg$trans)
[1] "auto(av)"   "auto(l3)"   "auto(l4)"   "auto(l5)"   "auto(l6)"   "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)"

对于效果（或虚拟）编码，glm 将第一级作为参考级，因此在这种情况下它将是“auto(av)”。

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.4173     0.7318  33.366  < 2e-16 ***
trans1        3.3827     2.3592   1.434 0.153017    
trans2        2.5827     3.6210   0.713 0.476426    
trans3       -2.4534     0.9157  -2.679 0.007928 ** 
trans4       -3.6993     1.0865  -3.405 0.000784 ***
trans5       -4.4173     2.1743  -2.032 0.043375 *  
trans6        1.2494     2.9866   0.418 0.676105    
trans7        0.9160     2.9866   0.307 0.759341    
trans8        0.7702     1.4517   0.531 0.596262    
trans9        1.8758     0.9845   1.905 0.058011 .

所以我现在认为 trans1 实际上是第二级别（“auto(l3)”），因为第一个是参考级别。为了测试这一点，我重新调整了因子，以便看到第一级的系数（“auto(av)”）：

mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
levels(mpg$trans)
"auto(l3)"   "auto(av)"   "auto(l4)"   "auto(l5)"   "auto(l6)"   "auto(s4)"   "auto(s5)"   "auto(s6)"   "manual(m5)" "manual(m6)"

现在我期待看到第一级的系数和第二级的系数不见了（因为现在是参考水平）：

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
summary(mpg_glm)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.4173     0.7318  33.366  < 2e-16 ***
trans1        2.5827     3.6210   0.713 0.476426    
trans2        3.3827     2.3592   1.434 0.153017    
trans3       -2.4534     0.9157  -2.679 0.007928 ** 
trans4       -3.6993     1.0865  -3.405 0.000784 ***
trans5       -4.4173     2.1743  -2.032 0.043375 *  
trans6        1.2494     2.9866   0.418 0.676105    
trans7        0.9160     2.9866   0.307 0.759341    
trans8        0.7702     1.4517   0.531 0.596262    
trans9        1.8758     0.9845   1.905 0.058011 .

我看到唯一改变的是前2个系数被切换了，那么以哪个级别作为参考？我在这里错过了什么？

【问题讨论】：

标签： r glm dummy-variable coefficients

【解决方案1】：

您正在使用contr.sum，其中所有级别都与最后一个级别进行比较，并且添加了所有系数（截距除外）总和为零的约束。

你可以在 mpg_glm 里面查看：

mpg_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))
mpg_glm$contrasts

           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(av)      1    0    0    0    0    0    0    0
auto(l3)      0    1    0    0    0    0    0    0
auto(l4)      0    0    1    0    0    0    0    0
auto(l5)      0    0    0    1    0    0    0    0
auto(l6)      0    0    0    0    1    0    0    0
auto(s4)      0    0    0    0    0    1    0    0
auto(s5)      0    0    0    0    0    0    1    0
auto(s6)      0    0    0    0    0    0    0    1
manual(m5)    0    0    0    0    0    0    0    0
manual(m6)   -1   -1   -1   -1   -1   -1   -1   -1
           [,9]
auto(av)      0
auto(l3)      0
auto(l4)      0
auto(l5)      0
auto(l6)      0
auto(s4)      0
auto(s5)      0
auto(s6)      0
manual(m5)    1
manual(m6)   -1

这里有 9 列，它们是模型中的 9 个非截距系数。由于总和为零的约束，它们或多或少是每个级别减去最后一个级别的平均值。最后一层是多余的，这里没有展示：

var_means = with(mpg,tapply(hwy,trans,mean))
cbind(delta_mean = var_means[-length(var_means)]-var_means[length(var_means)],
coef = coef(mpg_glm)[-1])

           delta_mean       coef
auto(av)    3.5894737  3.3827066
auto(l3)    2.7894737  2.5827066
auto(l4)   -2.2466709 -2.4534380
auto(l5)   -3.4925776 -3.6993447
auto(l6)   -4.2105263 -4.4172934
auto(s4)    1.4561404  1.2493733
auto(s5)    1.1228070  0.9160399
auto(s6)    0.9769737  0.7702066
manual(m5)  2.0825771  1.8758101

因此，当您更改第一个级别时，您只会更改它们出现的顺序，但值不会更改。您可以轻松检查对比度：

mpg$trans <- relevel(mpg$trans, ref="auto(l3)")
new_glm = glm(hwy ~ trans, data = mpg, contrasts = list(trans = contr.sum))    
new_glm$contrasts
$trans
           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
auto(l3)      1    0    0    0    0    0    0    0
auto(av)      0    1    0    0    0    0    0    0
auto(l4)      0    0    1    0    0    0    0    0
auto(l5)      0    0    0    1    0    0    0    0
auto(l6)      0    0    0    0    1    0    0    0
auto(s4)      0    0    0    0    0    1    0    0
auto(s5)      0    0    0    0    0    0    1    0
auto(s6)      0    0    0    0    0    0    0    1
manual(m5)    0    0    0    0    0    0    0    0
manual(m6)   -1   -1   -1   -1   -1   -1   -1   -1
           [,9]
auto(l3)      0
auto(av)      0
auto(l4)      0
auto(l5)      0
auto(l6)      0
auto(s4)      0
auto(s5)      0
auto(s6)      0
manual(m5)    1
manual(m6)   -1

对于您所描述的，作为更改参考并具有与此参考相关的其他级别，它应该是contr.treatment。

【讨论】：

感谢您的回答！我现在明白最后一个级别是参考级别。如何更改参考水平，以便我也可以看到最后一个水平的差异和这个因素的总平均值？
不太清楚你的意思。你指的是最后一个没有显示的系数吗？
是的，但我现在看到我必须更改级别的顺序，例如 mpg$trans = factor(mpg$trans, levels = c(""....etc))。然后我可以看到另一个系数:)
是的，这是一种方式。所以你设法用适当的水平得到你需要的东西？最后一个系数，如果你还记得的话，所有的非截距系数加起来为零。所以你做所有系数的总和，然后乘以 -1