【问题标题】:Combine attributes from two columns and sum the values from duplicate rows合并两列的属性并对重复行的值求和
【发布时间】:2021-04-12 11:00:37
【问题描述】:

本题由this one稍作修改。

我有一个长表格式的数据框,如下所示:

df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
                  name=c("a","c","a","c","a","c","a","c"),
                  value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name       value
1    a        broad
1    c           50
1    a     mangrove
1    c           50
1    a     mangrove
1    c           50
2    a   coniferous
2    c           50

关于数据:第二行的值50对应第一行的值broad。同样,第四行的值 50 对应于第三行的值 ma​​ngrove 依此类推。简单来说,名称 c 的值strong> 与名称 a 相关。

我想以这样一种方式组合值,以便我可以获得每个名称的对应值,这也会聚合具有相似名称的值:

df2 <- data.frame(ID=c(1,1,2),
                  name=c("c_broad","c_mangrove","c_coniferous"),
                  value=c(50,100,50))

应该是这样的:

ID         name    value
1       c_broad       50
1    c_mangrove      100
2  c_coniferous       50

【问题讨论】:

    标签: r dataframe


    【解决方案1】:

    使用reshape2

    library(reshape2)
    
    df1$grp = cumsum(df1$name == "a")
    df2 = dcast(df1, ID + grp ~ name)
    df2$c = as.numeric(df2$c)
    
    aggregate(c ~ ID + a, df2, sum)
    
      ID          a   c
    1  1      broad  50
    2  2 coniferous  50
    3  1   mangrove 100
    

    如果需要,可以更改列名,也可以使用粘贴将“c_”添加到名称中。

    【讨论】:

      【解决方案2】:

      使用 tidyverse:

      value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value) 
      df1 %>%
        dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
        dplyr::mutate(a = value_a,
               name = stringr::str_c(name, "_" ,a)) %>%
        dplyr::select(-a) %>% # to here
        dplyr::group_by(ID, name) %>%
        dplyr::summarise(value=sum(as.numeric(value)))
      
      # A tibble: 3 x 3
      # Groups:   ID [2]
           ID name         value
        <dbl> <chr>        <dbl>
      1     1 c_broad         50
      2     1 c_mangrove     100
      3     2 c_coniferous    50
      

      您在数据框中发现的主要问题是单个列包含名称和值,这是您应该解决的第一件事。我的建议是始终将原始数据框修改为整洁的格式 (https://tidyr.tidyverse.org/articles/tidy-data.html),然后利用所有 tidyverse 功能、data.table 或您选择的框架。

      请注意,时间变量 value_a 可以直接包含在管道中,为了清楚起见,我没有这样做。主要思想是分离不同列中的值和种类,管道中的前三个调用,然后应用通常的 tidyverse 操作。

      【讨论】:

        【解决方案3】:

        可能不是最优雅的,但它确实有效:

        df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
                          name=c("a","c","a","c","a","c","a","c"),
                          value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
        )
        
        df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
            summarize(
                ID = ID[1],
                name = paste0( name[2], "_", value[1] ),
                value = as.numeric(value[2])
            ) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
            mutate( value = sum(value) ) %>%
            unique
        
        

        这里有一些改进,实际上是人类可读的:

        
        i <- seq( 1, nrow(df1), 2 )
        df1 %>% summarise(
                    ID = ID[i],
                    name = paste0( name[i+1], "_", value[i] ),
                    value = as.numeric(value[i+1])
                ) %>% group_by(name) %>%
            summarize(
                ID=ID[1], value = sum( value )
            ) %>% arrange(ID)
        
        

        【讨论】:

          【解决方案4】:

          基础 R 解决方案:

          # Nullify numeric values belonging to a grouping category: grps => character vector
          grps <- gsub("\\d+", NA, df1$value)
          
          # Interpolate NA values using prior string value: a => character vector
          df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
          
          # Split-Apply-Combine aggregation: data.frame => stdout(console)
          data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
                y <- transform(subset(x, !grepl("\\D+", value)), value = as.numeric(value))
                setNames(
                  aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
                  c("ID", "a", "c")
                  )
                }
              )
            ), 
          row.names = NULL
          )
          

          【讨论】:

            【解决方案5】:

            附加选项

            df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
                              name=c("a","c","a","c","a","c","a","c"),
                              value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
            
            library(tidyverse)
            df1 %>% 
              pivot_wider(ID, names_from = name, values_from = value) %>% 
              unnest(c("a", "c")) %>% 
              group_by(ID, name = a) %>% 
              summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop") 
            
            #> # A tibble: 3 x 3
            #>      ID name       value
            #>   <dbl> <chr>      <dbl>
            #> 1     1 broad         50
            #> 2     1 mangrove     100
            #> 3     2 coniferous    50
            

            reprex package (v2.0.0) 于 2021-04-12 创建

            【讨论】:

              猜你喜欢
              • 2020-02-03
              • 1970-01-01
              • 2018-07-08
              • 1970-01-01
              • 2017-07-03
              • 1970-01-01
              • 2016-08-17
              • 2016-07-21
              • 1970-01-01
              相关资源
              最近更新 更多