如何在 R 中重新分类数据集的值和聚合行？答案

【问题标题】：How do I recategorize values and aggregate rows of a dataset in R?如何在 R 中重新分类数据集的值和聚合行？
【发布时间】：2020-11-15 16:35:52
【问题描述】：

我需要聚合数据集的行以折叠年龄范围。我的数据集目前有 5 岁的年龄范围。我试图将这些年龄范围组合成类别，同时汇总一些变量（人口、X1、X2、X3 和 X4），同时保持变量“类别”对于该特定 ID 中的每一行都是相同的。

我的数据集如下所示：

ID    Age.Range    Population   X1   X2   X3   X4   Category
1     05-09 years  10           1    0    0    1    a
1     10-14 years  20           0    0    1    0    a
1     30-34 years  10           0    0    1    0    a
1     40-44 years  15           2    0    0    1    a
2     05-09 years  15           1    1    0    2    b
2     25-29 years  10           0    0    0    0    b
3     10-14 years  15           0    1    2    0    a
3     15-19 years  10           1    0    0    1    a
3     20-24 years  15           0    0    1    3    a
3     30-34 years  20           0    0    1    0    a
3     35-39 years  10           0    1    0    0    a

我正在尝试生成一个结合年龄的新数据框，以便我的新年龄范围是 05-29 岁、30-39 岁和 40-49 岁，所以它看起来像这样：

ID    Age.Range    Population   X1   X2   X3   X4   Category
1     05-29 years  30           1    0    1    1    a
1     30-39 years  10           0    0    1    0    a
1     40-49 years  15           2    0    0    1    a
2     05-29 years  25           1    1    0    2    a
3     05-29 years  40           1    1    3    4    a
3     30-39 years  30           0    1    1    0    a

我试过用 dplyr 做这个，但没有成功。任何帮助将不胜感激！

【问题讨论】：

为了让我们帮助您，请提供reproducible example。例如，要生成最小数据集，您可以使用head()、subset() 或索引。然后使用dput() 给我们一些可以立即放入R 的东西。另外，请确保您知道该怎么做when someone answers your question。更多信息可以在 Stack Overflow 的help center 上找到。谢谢！
您可以将当前年龄组的最小和最大年龄提取到两个新列中，然后将它们重新分类到您想要的新组中。
我提供了一个解决方案，但我不明白为什么 05-29 years 的 ID 2 Category 值在您的预期输出中是 a。不应该是b吗？无论如何，这就是我的解决方案中包含的内容。

标签： r dplyr aggregate

【解决方案1】：

这应该可行：

your_data %>%
  mutate(
    First.Age.In.Range = as.numeric(str_extract(Age.Range, "^[0-9]+"))
    New.Age.Range = case_when(
      First.Age.In.Range < 30 ~ "05-29 years",
      First.Age.In.Range < 40 ~ "30-39 years",
      First.Age.In.Range < 50 ~ "40-49 years",
      First.Age.In.Range < 60 ~ "50-59 years",    
      ## not sure how high you need to go 
      ## catch-all for the last category
      TRUE ~ "90-99 years"
    )
  ) %>%
  group_by(ID, New.Age.Range, Population, Category) %>%
  summarize(across(starts_with("X"), sum))

【讨论】：

【解决方案2】：

这是使用tidyr、stringr 和dplyr 包的解决方案。它与 Gregor Thomas 提供的类似。它还让其他人有机会在我们等待添加编辑时与reproducible example 互动。

df <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3), Age.Range = c("05-09 years", 
"10-14 years", "30-34 years", "40-44 years", "05-09 years", "25-29 years", 
"10-14 years", "15-19 years", "20-24 years", "30-34 years", "35-39 years"
), Population = c(10L, 20L, 10L, 15L, 15L, 10L, 15L, 10L, 15L, 
20L, 10L), X1 = c(1L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), 
    X2 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L), X3 = c(0L, 
    1L, 1L, 0L, 0L, 0L, 2L, 0L, 1L, 1L, 0L), X4 = c(1L, 0L, 0L, 
    1L, 2L, 0L, 0L, 1L, 3L, 0L, 0L), Category = c("a", "a", "a", 
    "a", "b", "b", "a", "a", "a", "a", "a")), class = "data.frame", row.names = c(NA, 
-11L))


library(stringr)
library(dplyr)
library(tidyr)

df %>% 
  group_by(ID) %>% 
  separate(col = Age.Range, into = c("Age_1", "Age_2"), sep = "-") %>%
  # You will have to add ifelse statements if you have ages that are >49 in your dataset. 
  mutate(
    Age_2 = str_remove(Age_2, " years"),
    Age_1 = ifelse(Age_2 <= 29, "05-29 years", Age_1),
    Age_1 = ifelse(Age_2 > 29 & Age_2 <= 39, "30-39 years", Age_1),
    Age_1 = ifelse(Age_2 > 39 & Age_2 <= 49, "40-49 years", Age_1)
  ) %>%
  rename(Age.Range = Age_1) %>% 
  group_by(ID, Category, Age.Range) %>% 
  summarise(across(
    .cols = Population:X4, sum
  )) %>% 
  select(ID, Age.Range, Population, X1, X2, X3, X4, Category)


#> # A tibble: 6 x 8
#> # Groups:   ID, Category [3]
#>      ID Age.Range   Population    X1    X2    X3    X4 Category
#>   <dbl> <chr>            <int> <int> <int> <int> <int> <chr>   
#> 1     1 05-29 years         30     1     0     1     1 a       
#> 2     1 30-39 years         10     0     0     1     0 a       
#> 3     1 40-49 years         15     2     0     0     1 a       
#> 4     2 05-29 years         25     1     1     0     2 b       
#> 5     3 05-29 years         40     1     1     3     4 a       
#> 6     3 30-39 years         30     0     1     1     0 a

^{由reprex package (v0.3.0) 于 2020 年 11 月 15 日创建}

【讨论】：