如何一次汇总跨年龄段的多个变量/项目答案

【问题标题】：How to summarise multiple vriables/items across age group at once如何一次汇总跨年龄段的多个变量/项目
【发布时间】：2021-06-28 07:26:21
【问题描述】：

我有一个包含超过 150,000 个条目的数据框。示例如下：

ID <-    1111, 1222, 3333, 4444, 1555, 6666
V1 <-    1,     0,    1,    0,    0,     0
V2 <-    1,     0,    0,    0,    0,     1
V3 <-    0,     1,    1,    0,    0,     1
V4 <-    1,     0,    1,    1,    0,     0
AgeGr <- 15-24,24-35,15-24,35-48, 48+, 35-48

所有变量（示例中的 V1-V4）都是以 0/1 回答的二分题。现在我想总结一下年龄组中每个变量的 0/1 发生率。我期望这样的输出：

Variable       V1      V2      V3      V4    # Variale names
Answer        0  1    0  1    0  1    0  1   # answer levels (1/0)
15-24         0  2    1  1    1  1    0  2   # the frequency of "0" and "1" under this age group
24-35         1  0    1  0    0  1    1  0   
35-48         2  0    1  1    1  1    0  1
48+           1  0    1  0    1  0    1  0

我尝试过使用 tabyl(df,AgeGr, V1) 的 janitor::tabyl。然而，它只在一行中总结了 V1。当我尝试 tabyl(df,AgeGr, df[,V1:V4]) 时，它失败了。我想知道我是否可以使用 tabyl() 并使用 apply() 之类的函数？还是我应该转向别的东西？

我们将不胜感激任何建议。提前谢谢你:)

【问题讨论】：

你能分享一个可重现的例子吗？

标签： r apply summary

【解决方案1】：

您可以执行以下操作：

ID <-    c(1111, 1222, 3333, 4444, 1555, 6666)
V1 <-    c(1,     0,    1,    0,    0,     0)
V2 <-    c(1,     0,    0,    0,    0,     1)
V3 <-    c(0,     1,    1,    0,    0,     1)
V4 <-    c(1,     0,    1,    1,    0,     0)
AgeGr <- c("15-24","24-35","15-24","35-48", "48+", "35-48")

df <- data.frame(ID=ID,V1=V1,V2=V2,V3=V3,V4=V4,AgeGr = AgeGr, stringsAsFactors = FALSE)

ageAnswerSplit <- split(df[,c("V1","V2","V3","V4")],df[["AgeGr"]])

summarized <- do.call("rbind",lapply(ageAnswerSplit, function(answerdf) {
  answertables <- lapply(names(answerdf), function(nam) {
    at <- table(answerdf[[nam]])
    setNames(data.frame(unname(at["0"]),unname(at["1"])),paste0(nam,":",c(0,1)))
  })
  do.call("cbind",answertables)
}))
summarized[is.na(summarized)] <- 0

导致

> summarized
      V1:0 V1:1 V2:0 V2:1 V3:0 V3:1 V4:0 V4:1
15-24    0    2    1    1    1    1    0    2
24-35    1    0    1    0    0    1    1    0
35-48    2    0    1    1    1    1    1    1
48+      1    0    1    0    1    0    1    0

【讨论】：

非常感谢，这看起来是一个不错的解决方案 :) 它在我的 df 上运行良好。

【解决方案2】：

这是一个tidyverse 选项 -

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = starts_with('V')) %>%
  count(AgeGr, name, value) %>%
  unite(col, name, value) %>%
  arrange(col) %>%
  pivot_wider(names_from = col, values_from = n, values_fill = 0)

#  AgeGr  V1_0  V1_1  V2_0  V2_1  V3_0  V3_1  V4_0  V4_1
#  <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#1 24-35     1     0     1     0     0     1     1     0
#2 35-48     2     0     1     1     1     1     1     1
#3 48+       1     0     1     0     1     0     1     0
#4 15-24     0     2     1     1     1     1     0     2

【讨论】：