对列中的每个值执行泊松回归答案

【问题标题】：Perform poisson regression for each value in column对列中的每个值执行泊松回归
【发布时间】：2015-11-21 20:44:48
【问题描述】：

我有一个长格式的数据框，我正在对它执行泊松回归。

 'data.frame':  20000 obs. of  6 variables:
 $ cal_y  : int  2008 2008 2008 2008 2008 ...
 $ age_y  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ gender : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ cal_m  : int  9 7 8 1 6 11 2 10 3 4 ...
 $ n_outcome: int  276 187 164 352 229 250 332 267 348 291 ...
 $ n_atrisk : int  4645 4645 4645 4645 4645 4645 4645 4645 4645 4645 ...

glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family =poisson)

我想知道每个 age_y 的结果 n_outcome 的曝光 cal_y 系数，并且最好能够将这些信息汇总到一个 df 中。

我已经尝试了几个错误的 lapply() 和 tapply() 版本。目前，我最好的解决方案是手动完成：

glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = filter(df, age_y >= 0, age_y <1), family =poisson)

但这既乏味（范围（age_y）= 0 105），结果也不容易合并到新的df中，我不确定在执行回归之前对数据进行子集化在统计上是否正确。

感谢任何指针、cmets 或帮助。

【问题讨论】：

您究竟为什么要进行子集回归而不是单次回归？
@Alex 我正在尝试确定曝光对计数结果 (n_outcome) 的影响，在本例中为日历年 (cal_y)。我都有理由相信，效果会因年龄（age_y）而有很大差异，而且这种变化的程度对于我的研究方向非常有趣。更准确地说，我想知道每个age_y 的相对风险。如果有更好的方法，我非常愿意接受建议
是的，但是使用 subsets 的原因是什么？ IE- 你为什么不把age_y 作为一个自变量呢？
@Alex 如果我将 age_y 作为自变量包含在内，系数输出描述了 age_y = (1, 2, 3, ... , 100) 与 age_y = 0 相比的 n_outcome 差异。我会想知道每个age_y 的cal_y 的n_outcome 差异系数。实际上，我会将 age_y 聚合到 age_brackets 中，但为简单起见，我忽略了在问题词干中提及这一点。
@Alex 这是一种有缺陷的思维方式吗？

标签： r poisson

【解决方案1】：

您可以使用 dplyr 和我的 broom 包来做到这一点：

library(dplyr)
library(broom)

results <- df %>%
  group_by(age_y) %>%
  do(tidy(glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
              data = ., family =poisson)))

之所以有效，是因为group_by 和do 对每个age_y 值执行回归，然后tidy 将每个回归转换为可以重新组合的数据框。

请参阅broom and dplyr 小插图了解更多信息。

【讨论】：

【解决方案2】：

您在上面的 cmets 中描述的是，如果您使用效果编码与均值模型会发生什么。两者是等价的，只是表示对参数施加的不同约束（留一与总和为 0）。但它似乎在这里扭曲了你的想法。如果您使用age_y 作为分类编码，您将获得适当的输出

通过使用子集回归，您不会在每个模型中包含所有可用信息，这是统计上无效的方法。这也增加了第一类错误率。您应该使用模型中的所有可用信息。因此，这是正确的规范：

# this is the default way that R handles things, leaving one level out.
glm(n_outcome ~ factor(age_y) + factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family= poisson(link = "log"))

# In contrast, this will provide an estimate for each level of
# factor(age_y) where the test-statistic is if the coefficient
# is statistically different from zero.
glm(n_outcome ~ 0 + factor(age_y) + factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family= poisson(link = "log"))

上述内容是正确的，除非您预计每个age_y 对cal_y、gender 和/或n_atrisk 有不同的影响，您需要估计这些变量与@987654327 的相互作用@（仍然可以在单个模型中指定）。

如果您想测试age_y 的级别是否彼此不同，您可以测试它们的对比。

也就是说，使用(... 0 + factor(age_y) ...)（听起来像是您的偏好）为您提供age_y 相对于原假设的每个级别的信息，但它不提供age_y:level1 vs age_y:level2 之间的统计检验。要进行此测试，您需要测试它们的对比。

【讨论】：

感谢您的帮助。我看到了你所说的智慧，并将在接下来的几天里阅读效果编码与手段模型。我现在还没有假装理解这意味着什么。但是，测试您给我的代码时，我显然使用了不正确的编码类型，因为输出与我在上面的评论中描述的一样，即系数输出描述了 age_y = (0, 1, .. . , 100) 与 age_y = 0 相比。通过交互，它显示相同，除了因子（cal_y）。
这是R中的默认值。如果您使用glm(n_outcome ~ 0 + factor(age_y) ... )，则为age_y 的每个级别报告的统计数据将测试该系数在统计上是否不同于0。由于我不确定您到底想要什么，所以我没有将其放入我的解决方案中。我会更新的。