【问题标题】:making a wider dataframe using factor columns使用因子列制作更广泛的数据框
【发布时间】:2020-12-03 05:15:29
【问题描述】:

好的,所以这个有点长,我有几个巨大的数据框,我正在尝试使其更宽并最终合并。我想按年份和县进行合并和分组。

我有几个专栏,其中包含我试图传播的因素。本质上,我想采用因子 x、y、z 并将它们设为列 x、y 和 z。 下面有一个例子。此外,我有几列是数字的,我想按组求和。

我试图提供一个示例和一些可重现的代码,希望这已经足够了,但是如果有什么我可以做的让事情变得更容易/更清晰,请告诉我,非常感谢你的帮助!

 YR<-as.factor( c(2019,2018,2019,2019,2018,2018,2019,2019,2018))
    STATE<-as.factor( c("CA","MA","KY","KY","CA","MA","KY","KY","CA"))
    COUNTY<-as.factor( c("C1","M1","K1","K2","C1","M2","K1","K2","C1"))
    CANCER<-as.factor(c("Cervical","Lung","Prostate","Breast","Cervical","Breast","Prostate","Prostate","Lung"))
    rand_fact<-as.factor(c("rf1","rf2","rf3","fr4","fr5","rf2","rf3","fr4","fr5"))
    rand_num<-as.numeric(c(4,3,5,7,3,5,3,24,9))
    rand_chr<-as.character(c("a","d","r","e","g","y","r","e","k"))
    TEST_DR<-data.frame(YR,STATE,COUNTY,CANCER,rand_fact,rand_num,rand_chr)
    rm(YR,STATE,COUNTY,CANCER,rand_chr,rand_num,rand_fact)
    > print(TEST_DR)
        YR STATE COUNTY   CANCER rand_fact rand_num rand_chr
    1 2018    CA     C1 Cervical       fr5        3        g
    2 2018    CA     C1     Lung       fr5        9        k
    3 2018    MA     M1     Lung       rf2        3        d
    4 2018    MA     M2   Breast       rf2        5        y
    5 2019    CA     C1 Cervical       rf1        4        a
    6 2019    KY     K1 Prostate       rf3        5        r
    7 2019    KY     K1 Prostate       rf3        3        r
    8 2019    KY     K2   Breast       fr4        7        e
    9 2019    KY     K2 Prostate       fr4       24        e
    

#Idealy the output will look like below with rows grouped by YR then COUNTY

    TEST_DR<-arrange(.data = TEST_DR,YR,COUNTY)
    YR<-as.factor( c(2018,2018,2018,2019,2019,2019))
    STATE<-as.factor( c("CA","MA","MA","CA","KY","KY"))
    COUNTY<-as.factor( c("C1","M1","M2","C1","K1","K2"))
    Cervical<-as.numeric(c(1,0,0,1,0,0))
    Lung <-as.numeric(c(1,1,0,0,0,0))
    Prostate<-as.numeric(c(0,0,0,0,2,1))
    Breast<-as.numeric(c(0,0,1,0,0,1))
    
    TEST_DR2 <-data.frame(YR,STATE,COUNTY,Cervical,Lung,Prostate,Breast)
    rm(YR,STATE,COUNTY,Cervical,Lung,Prostate,Breast)
    > print(TEST_DR2)

        YR STATE COUNTY Cervical Lung Prostate Breast rand_num
    1 2018    CA     C1        1    1        0      0       12
    2 2018    MA     M1        0    1        0      0        3
    3 2018    MA     M2        0    0        0      1        5
    4 2019    CA     C1        1    0        0      0        4
    5 2019    KY     K1        0    0        2      0        8
    6 2019    KY     K2        0    0        1      1       31

【问题讨论】:

    标签: r dataframe dplyr tidyr


    【解决方案1】:

    这是一种使用 count() 和 {tidyr} spread() 的方法

    YR <- as.factor( c(2019,2018,2019,2019,2018,2018,2019,2019,2018))
    STATE <- as.factor( c("CA","MA","KY","KY","CA","MA","KY","KY","CA"))
    COUNTY <- as.factor( c("C1","M1","K1","K2","C1","M2","K1","K2","C1"))
    CANCER <- as.factor(c("Cervical","Lung","Prostate","Breast","Cervical","Breast","Prostate","Prostate","Lung"))
    rand_fact <- as.factor(c("rf1","rf2","rf3","fr4","fr5","rf2","rf3","fr4","fr5"))
    rand_num <- as.numeric(c(4,3,5,7,3,5,3,24,9))
    rand_chr <- as.character(c("a","d","r","e","g","y","r","e","k"))
    TEST_DR <- data.frame(YR, STATE, COUNTY, CANCER, rand_fact, rand_num, rand_chr)
    rm(YR,STATE,COUNTY,CANCER,rand_chr,rand_num,rand_fact)
    
    library(dplyr, warn.conflicts = FALSE)
    library(tidyr)
    
    TEST_DR %>% 
      group_by(YR, STATE, COUNTY) %>%
      count(CANCER, rand_num = sum(rand_num)) %>%
      spread(CANCER, n, fill = 0)
    #> # A tibble: 6 x 8
    #> # Groups:   YR, STATE, COUNTY [6]
    #>   YR    STATE COUNTY rand_num Breast Cervical  Lung Prostate
    #>   <fct> <fct> <fct>     <dbl>  <dbl>    <dbl> <dbl>    <dbl>
    #> 1 2018  CA    C1           12      0        1     1        0
    #> 2 2018  MA    M1            3      0        0     1        0
    #> 3 2018  MA    M2            5      1        0     0        0
    #> 4 2019  CA    C1            4      0        1     0        0
    #> 5 2019  KY    K1            8      0        0     0        2
    #> 6 2019  KY    K2           31      1        0     0        1
    

    reprex package (v0.3.0) 于 2020-12-02 创建

    对于最新的 {tidyverse} 语法糖...

    TEST_DR %>% 
      group_by(YR, STATE, COUNTY) %>%
      count(CANCER, rand_num = sum(rand_num)) %>%
      pivot_wider(names_from = CANCER, values_from = n, values_fill = 0)
    

    【讨论】:

    • 如果需要也可以使用tidyr::replace_natidyr.tidyverse.org/reference/replace_na.html
    • 或者只是将fill = 0 添加到spread。另外,我相信软件包开发人员鼓励使用pivot_wider 而不是spread
    • 据我所知,没有计划弃用 gather/spread。但你是对的,并且 +1 表示填充。
    • 这看起来很完美,我会在我更大的 DF 上试试。让你知道它是怎么回事。谢谢 可以计算多列还是我应该使用多个管道(?)来计算/求和/平均不同的列?
    • 关于你的第一个问题——答案是肯定的,count 可以取多列。我认为您没有足够的行.. 即在您的示例中 YR、STATE、COUNTY 中 rand_fact 的变化来演示,但您应该能够以逗号分隔添加它们。
    【解决方案2】:

    除了必须聚合rand_num 列之外,您几乎可以直接在此使用dcast。以下是我的处理方法:

    library(data.table)
    # Create a vector of keys that we can use for grouping and for 
    # identifying the columns for the left-hand-side of the dcast formula
    keys <- c("YR", "STATE", "COUNTY")
    # * melt from data.table expects a data.table, so use either setDT or
    #   as.data.table to convert your data.frame to a data.table
    # * .N creates a count by the grouping variables. CANCER has been
    #   added since we want to count the number of instances. It will
    #   create a new column named "N" in the data
    as.data.table(TEST_DR)[, list(rand_num, .N), c(keys, "CANCER")][
      # Sum the rand_num variable by the grouping variable
      , rand_num := sum(rand_num), keys][
        # Go from long to wide using dcast. 
        # * ... on the left-hand-side of the formula says to use all
        #   of the unspecified variables
        # * ~ CANCER says that the values from the CANCER column should
        #   become the new column names
        # * value.var = "N" says to fill in the combination of LHS and
        #   RHS with values from the N column
        , dcast(.SD, ... ~ CANCER, value.var = "N")]
    #      YR STATE COUNTY rand_num Breast Cervical Lung Prostate
    # 1: 2018    CA     C1       12      0        1    1        0
    # 2: 2018    MA     M1        3      0        0    1        0
    # 3: 2018    MA     M2        5      1        0    0        0
    # 4: 2019    CA     C1        4      0        1    0        0
    # 5: 2019    KY     K1        8      0        0    0        2
    # 6: 2019    KY     K2       31      1        0    0        1
    

    “data.table”命令可以链接起来(类似于使用管道),只需将一个操作的结果传递给下一个操作。例如,as.data.table(df)[, do_something][, do_something_else][, do_even_more]

    【讨论】:

    • 希望您能亲自指导我完成此操作,因为我没有关注...。键设置在展开中不会更改的列...。所有括号都在做什么?
    • @JasonDeutsch,我在代码中添加了一些 cmets。希望它能让事情变得更清楚一些。额外的方括号只是将一个操作的结果传递到下一步。您可以选择创建中间“data.tables”,而不是仅仅作为一个长命令来完成。
    • 嗯!!我从未见过 as.data.table 使用过。通常我看到 as.data.frame 使用过,这种类似于管道的链接能力虽然令人着迷!你的cmets帮助澄清了很多。所以谢谢你的额外工作。我仍然对列表功能有点困惑,但不幸的是,我必须在以后回来问你这个问题。非常感谢帮忙。学到了很多!
    • 如果您有时间,我发布了一个新问题,我认为您可能会有解决方案。请,谢谢!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-20
    • 1970-01-01
    • 1970-01-01
    • 2013-11-09
    • 2021-04-16
    • 1970-01-01
    相关资源
    最近更新 更多