按年份汇总并创建一个新变量，其中包含每行唯一值的向量/列表答案

【问题标题】：Summarize by year and create a new variable containing a vector / list of unique values for each row按年份汇总并创建一个新变量，其中包含每行唯一值的向量/列表
【发布时间】：2021-11-15 16:26:40
【问题描述】：

我有一个数据集，其中包含在特定时间段内发生的所有自然灾害。我想按年份和州对它们进行总结。总结时，我想创建一个变量（= d_disasters），向我展示自然灾害的独特类型，例如对于德克萨斯，我希望只显示飓风。

我目前正在使用 dplyr:group_by 和 dplyr::summarize 按年份和状态汇总我的数据 & dplyr::mutate 和 dplyr:map_int 以创建具有每年自然灾害总数的新变量（$n_disasters 使用长度）和自然灾害的唯一数量（$n_distinct 使用 n_distinct()）。

起始数据集：

structure(list(year = c(1998, 1998, 1998, 1998, 1998), country = c("US", 
"US", "US", "US", "US"), state = c("Texas", "Texas", "California", 
"New York", "New York"), deaths = c(12, 5, 9, 10, 18), injured = c(3, 
1, 3, 5, 9), disastertype = c("Hurricane", "Hurricane", "Wild fire", 
"Flood", "Epidemic")), class = "data.frame", row.names = c(NA, 
-5L))

结果数据集：

structure(list(year = c(1998, 1998, 1998), state = c("California", 
"New York", "Texas"), u_disastertype = c("Wild fire", "Flood, Epidemic", 
"Hurricane"), disastertype = c("Wild fire", "Flood, Epidemic", 
"Hurricane, Hurricane"), deaths = c(9, 28, 17), injured = c(3, 
14, 4), n_distinct = c(1L, 2L, 1L), n_disasters = c(1L, 2L, 2L
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L), groups = structure(list(year = 1998, .rows = structure(list(
    1:3), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-1L), .drop = TRUE))

编辑：为澄清而编辑。

【问题讨论】：

标签： r dplyr

【解决方案1】：

试试aggregate。这需要 2 3 个聚合的输出并将它们放在一起。

list2 <- function(x){ c(unique(x),length(table(x))) }

lt <- list(year=dat$year, county=dat$country, state=dat$state )

data.frame( aggregate( dat[,c(4,5)], lt, sum ), 
  setNames( aggregate( dat$disastertype, lt, list2 )[,4, drop=F], colnames(dat)[6] ), 
  setNames( aggregate( dat$disastertype, lt, length )[,4, drop=F], "n_disasters") )

  year county      state deaths injured       disastertype n_disasters
1 1998     US California      9       3       Wild fire, 1           1
2 1998     US   New York     28      14 Flood, Epidemic, 2           2
3 1998     US      Texas     17       4       Hurricane, 1           2

不确定是否要保留 n_... 列...

编辑：添加“n_disasters”

EDIT2：添加了包含“不同灾难”的建议

【讨论】：

是的，我想保留n_columns，看我的回答！我昨天自己偶然发现了它，并用它暂时继续使用dplyr 包。我很想知道是否有使用aggregate 的解决方案。
您还需要独特的？似乎多余，因为所有都是 1。 distinct 来自哪里？
我的编码中有一个错误，使所有列的 n_distinct 1。实际上，纽约的 n_distinct 应该为 2，因为那一年发生了两种不同类型的灾难（洪水和流行病），而在德克萨斯州，我们有两种灾难，但只有一种不同的类型（飓风）。为编码错误道歉。我已经在问题中更正了。

【解决方案2】：

使用dplyr 与group_by 和summarize 的解决方案。关键部分是在disastertype = paste(disastertype, collapse = ', '),之前运行u_disastertype = toString(unique(disastertype)),

naturaldisaster2 <- naturaldisaster %>%
  group_by(year, state) %>%
  summarise(
    u_disastertype = toString(unique(disastertype)),
    disastertype = paste(disastertype, collapse = ', '),
    deaths=sum(deaths),
    injured=sum(injured)
    )

答案基于 Stackoverflow 对类似问题的回答，其中仅在列上运行了一个操作，而我在同一列上运行了两个操作：https://stackoverflow.com/a/46367425/11045110

【讨论】：