【问题标题】:Merger and partial addition of rows without groups in RR中没有组的行的合并和部分添加
【发布时间】:2018-10-09 22:42:08
【问题描述】:

以下是我为 dplyr 编写的问题的表述:

library(tidyverse)

df <- tibble(State = c("A", "A", "A", "A", "A", "A", "B", "B", "B"),
             District_code = c(1:9),
                 District = c("North", "West", "North West", "South", "East", "South East", 
                              "XYZ", "ZYX", "AGS"), 
                 Population = c(1000000, 2000000, 3000000, 4000000, 5000000, 6000000, 
                                7000000, 8000000, 9000000))

df
#> # A tibble: 9 x 4
#>   State District_code District   Population
#>   <chr>         <int> <chr>           <dbl>
#> 1 A                 1 North         1000000
#> 2 A                 2 West          2000000
#> 3 A                 3 North West    3000000
#> 4 A                 4 South         4000000
#> 5 A                 5 East          5000000
#> 6 A                 6 South East    6000000
#> 7 B                 7 XYZ           7000000
#> 8 B                 8 ZYX           8000000
#> 9 B                 9 AGS           9000000

对于某些州,我需要将使用名称的地区合并到更少的地理类别中。特别是,A 国应该只有:“North - West - North West”和“South - East - South East”。必须添加一些变量,例如人口;但其他像 District_code 应该获得 NA。我发现了this example 的跨行操作,但并不完全相同。 Grouping 似乎不适用。

最终的结果应该是这样的:

new_df
#> # A tibble: 5 x 4
#>   State District_code District                  Population
#>   <chr>         <int> <chr>                          <dbl>
#> 1 A                NA North - West - North West    5000000
#> 2 A                NA South - East - South East   15000000
#> 3 B                 7 XYZ                          7000000
#> 4 B                 8 ZYX                          8000000
#> 5 B                 9 AGS                          9000000

在实际数据框中,有许多变量(例如 Population)必须添加,还有一些其他变量(例如 District_code)必须获取 NA 值。

非常感谢您的帮助!

【问题讨论】:

    标签: r dataframe dplyr


    【解决方案1】:

    您可以使用fct_collapse 指定新的因子水平,然后在新组上使用summarise

    df %>%
      mutate(District = 
               fct_collapse(District, 
                            "North - West - North West" = c("North", "West", "North West"), 
                            "South - East - South East" = c("South", "East", "South East"))) %>% 
      group_by(State, District) %>% 
      summarise(Population = sum(Population), 
                District_code = ifelse(n() > 1, NA_real_, District_code))
    
    # A tibble: 5 x 3
    # Groups:   State [?]
    #   State District                  Population
    #   <chr> <fct>                          <dbl>
    # 1 A     South - East - South East   15000000
    # 2 A     North - West - North West    6000000
    # 3 B     AGS                          9000000
    # 4 B     XYZ                          7000000
    # 5 B     ZYX                          8000000
    

    如果您只想更改某个特定州的地区,您可以像这样添加case_whenif_else,并将汇总函数作为列类型的条件(这里是人口的两倍,而不是整数区)

    df %>%
      mutate(District = 
               case_when(State == "A" ~ 
                           fct_collapse(District, 
                                        "North - West - North West" = c("North", "West", "North West"), 
                                        "South - East - South East" = c("South", "East", "South East")), 
                         TRUE ~ factor(District))) %>% 
      group_by(State, District) %>% 
      summarise_all(funs({if(is.double(.)) {
        sum(.) 
      } else {
        if (length(unique(.)) > 1) {
          NA
        } else {
          unique(.)
        }
      }}))
    

    【讨论】:

    • 谢谢凯丝!我认为您的第二个示例使其真正可扩展。如果我理解正确,只有那些是 dbls 的列被求和;当两个或更多已分组或保持其值时,其余部分要么变成 NA。正是我需要的!不过,我遇到了一个奇怪的错误:DISTRICT_CODE 列无法将第 138 组提升为字符。评估中的某些内容不太有效?
    • 这很奇怪。 DISTRICT_CODE 列是否像您的示例中那样是整数?
    • DISTRICT_CODE 实际上是字符,我没有意识到这在您的解决方案中很重要;将其更改为 int 确实使整个事情运行...但是另一个字符变量 REGION_TYPE ["Urban", "Rural"] 只是全部设置为 NA?不太清楚这些事情是如何发生的,因为代码似乎无法区分不同类型的变量...... :-/ 感谢您提供任何进一步的线索!
    【解决方案2】:

    对于某些州,我需要将使用名称的地区合并到更少的地理类别中。特别是,A 国应该只有:“North - West - North West”和“South - East - South East”。

    您需要写下分组规则,例如...

    merge_rules = list(
      list(State = "A", District = c("North", "West", "North West")),
      list(State = "A", District = c("South", "East", "South East"))
    )
    

    必须添加一些变量,例如人口;但其他像 District_code 应该获得 NA。

    我会通过将合并规则放在一个表格中来做到这一点;合并后进行计算;并对未合并的行进行 rbind-ing。这是data.table的方式...

    library(data.table)
    DT  = data.table(df)
    mDT = rbindlist(lapply(merge_rules, as.data.table), id = "g")
    
    gDT = DT[mDT, on=.(State, District)][, .(
      District_code = District_code[NA_integer_],
      District = paste(District, collapse = " - "),
      Population = sum(Population)
    ), by=.(g, State)]
    
    rbind(
      DT[!mDT, on=.(State, District)],
      gDT[, !"g"]
    )[order(State, District)]
    
       State District_code                  District Population
    1:     A            NA North - West - North West    6.0e+06
    2:     A            NA South - East - South East    1.5e+07
    3:     B             9                       AGS    9.0e+06
    4:     B             7                       XYZ    7.0e+06
    5:     B             8                       ZYX    8.0e+06
    

    而且,我猜,tidyverse 的方式是相似的:

    mtib = bind_rows(lapply(merge_rules, as.tibble), .id = "g")
    
    gtib = right_join(df, mtib, by=c("State", "District")) %>% 
      group_by(g, State) %>% summarise(
        District_code = District_code[NA_integer_],
        District = paste(District, collapse = " - "),
        Population = sum(Population)    
      )
    
    bind_rows(
      anti_join(df, mtib, by=c("State", "District")),
      gtib %>% ungroup %>% select(-g)
    ) %>% arrange(State, District)
    
    # A tibble: 5 x 4
      State District_code District                  Population
      <chr>         <int> <chr>                          <dbl>
    1 A                NA North - West - North West    6000000
    2 A                NA South - East - South East   15000000
    3 B                 9 AGS                          9000000
    4 B                 7 XYZ                          7000000
    5 B                 8 ZYX                          8000000
    

    【讨论】:

      【解决方案3】:

      这是获取状态 A 的聚合人口的一种方法:

      df %>% 
        filter(State == "A") %>%
        mutate(`North - West - North West` = (District == "North"|District == "West"|District == "North West"), 
               `South - East - South East` = (District == "South"|District == "East"|District == "South East")) %>% 
        gather(key = Districts, value = present, 5:6) %>% 
        filter(present != FALSE) %>% 
        group_by(Districts) %>% 
        summarise(Population = sum(Population))
      

      它给出了输出:

        Districts          Population
        <chr>                   <dbl>
      1 North - West - No…    6000000
      2 South - East - So…   15000000
      

      应该有人能帮我们把上面的东西放到原来的df中。

      【讨论】:

        猜你喜欢
        • 2023-03-22
        • 1970-01-01
        • 2021-03-20
        • 1970-01-01
        • 2019-12-07
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多