【问题标题】:How to aggregate R dataframe of two columns based on values of another如何根据另一列的值聚合两列的 R 数据框
【发布时间】:2021-07-24 01:21:12
【问题描述】:

我的数据框如下,其中 gender=="1" 是指男性,gender=="2" 是指女性,职业从 A 到 U,年份从 2010 年到 2018 年(我给你一个小例子)

Gender   Occupation    Year
1            A         2010
1            A         2010
2            A         2010
1            B         2010
2            B         2010
1            A         2011
2            A         2011
1            C         2011
2            C         2011

我想要一个输出汇总性别、年份和职业不同的行数,如下所示:

Year | Occupation | Men | Woman
2010 |      A     |  2  |   1
2010 |      B     |  1  |   1
2011 |      A     |  1  |   1
2011 |      C     |  1  |   1

我尝试了以下方法:

Nr_gender_occupation <- data %>%
   group_by(year, occupation) %>%
   summarise(
      Men = aggregate(gender=="1" ~ occupation, FUN= count),
      Women = aggregate(gender=="2" ~ occupation, FUN=count)
)

【问题讨论】:

  • 只是好奇(为了假设和可能的数据丢失),您是否关心非二进制性别值?
  • 不,我只有二进制性别值。谢谢!

标签: r dataframe aggregate summarize


【解决方案1】:

使用dcastdata.table 选项

dcast(setDT(df), Year + Occupation ~ c("Men", "Woman")[Gender])

给予

   Year Occupation Men Woman
1: 2010          A   2     1
2: 2010          B   1     1
3: 2011          A   1     1
4: 2011          C   1     1

【讨论】:

    【解决方案2】:

    我们可以使用“性别”中的索引来更改值,然后使用 pivot_wider from tidyr 将数据重新整形为“宽”格式

    library(dplyr)
    library(tidyr)
    data %>%
     mutate(Gender = c("Male", "Female")[Gender]) %>%
     pivot_wider(names_from = Gender, values_from = Gender, values_fn = length)
    

    -输出

    # A tibble: 4 x 4
    #  Occupation  Year  Male Female
    #  <chr>      <int> <int>  <int>
    #1 A           2010     2      1
    #2 B           2010     1      1
    #3 A           2011     1      1
    #4 C           2011     1      1
    

    或者使用tableunnest

    library(tidyr)
    data %>%
       group_by(Year, Occupation) %>%
       summarise(out = list(table(Gender)), .groups = 'drop') %>%     
       unnest_wider(out)
    

    或者我们可以使用countpivot_wider

    data %>%
      count(Gender, Occupation, Year) %>%
      pivot_wider(names_from = Gender, values_from = n)
    

    数据

    data <- structure(list(Gender = c(1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), 
        Occupation = c("A", "A", "A", "B", "B", "A", "A", "C", "C"
        ), Year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2011L, 2011L, 
        2011L, 2011L)), class = "data.frame", row.names = c(NA, -9L
    ))
    

    【讨论】:

      【解决方案3】:

      您也可以在您的组内进行计数:

      library(dplyr)
      
      df %>% 
        group_by(Occupation, Year) %>% 
        summarize(Men = sum(Gender == 1),
                  Woman = sum(Gender == 2), .groups = "drop")
      

      输出

        Occupation  Year   Men Woman
        <chr>      <dbl> <int> <int>
      1 A           2010     2     1
      2 A           2011     1     1
      3 B           2010     1     1
      4 C           2011     1     1
      

      【讨论】:

        猜你喜欢
        • 2019-12-18
        • 1970-01-01
        • 1970-01-01
        • 2022-11-10
        • 1970-01-01
        • 1970-01-01
        • 2021-04-30
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多