【问题标题】:Add new column to dataframe with duplicated values pasted together [duplicate]将重复值粘贴在一起的新列添加到数据框[重复]
【发布时间】:2019-11-18 20:07:33
【问题描述】:

我有一个df,看起来像这样:

ID  Country
55  Poland
55  Romania
55  France
98  Spain
98  Portugal
98  UK
65  Germany
67  Luxembourg
84  Greece
22  Estonia
22  Lithuania

其中一些ID 被重复,因为它们属于同一组。我想要做的是将paste 与所有Country 与相同的ID 一起,得到这样的输出。

到目前为止,我尝试过 ifelse(df[duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE),], paste('Countries', df$Country), NA) 但这不是检索预期的输出。

【问题讨论】:

    标签: r dataframe


    【解决方案1】:

    purrrdplyr

        df %>%
        nest(-ID) %>% 
        mutate(new_name = map_chr(data, ~ paste0(.x$Country, collapse = " + "))) %>% 
        unnest()
    

    表:

      ID new_name                  Country     
      55 Poland + Romania + France Poland    
      55 Poland + Romania + France Romania   
      55 Poland + Romania + France France    
      98 Spain + Portugal + UK     Spain     
      98 Spain + Portugal + UK     Portugal  
      98 Spain + Portugal + UK     UK        
      65 Germany                   Germany   
      67 Luxembourg                Luxembourg
      84 Greece                    Greece    
      22 Estonia + Lithuania       Estonia   
      22 Estonia + Lithuania       Lithuania 
    

    【讨论】:

      【解决方案2】:

      使用data.table

      library(data.table)
      
      setDT(df)[, New_Name := c(paste0(Country, collapse = " + ")[1L],  rep(NA, .N -1)), by = ID]
      
      #df
      #ID    Country                  New_Name
      #1: 55     Poland Poland + Romania + France
      #2: 55    Romania                      <NA>
      #3: 55     France                      <NA>
      #4: 98      Spain     Spain + Portugal + UK
      #5: 98   Portugal                      <NA>
      #6: 98         UK                      <NA>
      #7: 65    Germany                   Germany
      #8: 67 Luxembourg                Luxembourg
      #9: 84     Greece                    Greece
      #10: 22    Estonia       Estonia + Lithuania
      #11: 22  Lithuania                      <NA>
      

      【讨论】:

      • 另一种可能:setDT(df)[rowid(ID)==1L, nn := df[, paste(Country, collapse=" + "), ID]$V1]
      【解决方案3】:

      仅在第一次使用aggregate 后使用match

      flat <- function(x) paste("Countries:", paste(x,collapse=", "))
      tmp <- aggregate(Country ~ ID, data=dat, FUN=flat)
      dat$Country <- NA
      dat$Country[match(tmp$ID, dat$ID)] <- tmp$Country
      
      #   ID                            Country
      #1  55 Countries: Poland, Romania, France
      #2  55                               <NA>
      #3  55                               <NA>
      #4  98     Countries: Spain, Portugal, UK
      #5  98                               <NA>
      #6  98                               <NA>
      #7  65                 Countries: Germany
      #8  67              Countries: Luxembourg
      #9  84                  Countries: Greece
      #10 22      Countries: Estonia, Lithuania
      #11 22                               <NA>
      

      【讨论】:

        【解决方案4】:

        使用dplyr,一种方法是

        library(dplyr)
        df %>%
          group_by(ID) %>%
          mutate(new_name = paste0(Country,collapse = " + "), 
                 new_name = replace(new_name, duplicated(new_name), NA))
        
        #     ID Country    new_name                 
        #   <int> <fct>      <chr>                    
        # 1    55 Poland     Poland + Romania + France
        # 2    55 Romania    NA                       
        # 3    55 France     NA                       
        # 4    98 Spain      Spain + Portugal + UK    
        # 5    98 Portugal   NA                       
        # 6    98 UK         NA                       
        # 7    65 Germany    Germany                  
        # 8    67 Luxembourg Luxembourg               
        # 9    84 Greece     Greece                   
        #10    22 Estonia    Estonia + Lithuania      
        #11    22 Lithuania  NA                  
        

        但是,为了获得您的确切预期输出,我们可能需要

        df %>%
           group_by(ID) %>%
           mutate(new_name = if (n() > 1) 
                 paste0("Countries ", paste0(Country,collapse = " + ")) else Country,
                 new_name = replace(new_name, duplicated(new_name), NA))
        
        
        
        #     ID Country    new_name                           
        #    <int> <fct>      <chr>                              
        # 1    55 Poland     Countries Poland + Romania + France
        # 2    55 Romania    NA                                 
        # 3    55 France     NA                                 
        # 4    98 Spain      Countries Spain + Portugal + UK    
        # 5    98 Portugal   NA                                 
        # 6    98 UK         NA                                 
        # 7    65 Germany    Germany                            
        # 8    67 Luxembourg Luxembourg                         
        # 9    84 Greece     Greece                             
        #10    22 Estonia    Countries Estonia + Lithuania      
        #11    22 Lithuania  NA                              
        

        【讨论】:

        • 要得到原题的准确结果,加...mutate(new_name = paste("Countries",paste0(Country,collapse = " + ")), ...
        • @RonakShah 谢谢!!但是我怎样才能在国家组的开头只添加一次Countries,而不是每次都列出一个新国家?即Countries Poland + Romania + France.
        • @Biostatician 哎呀……对不起。没有注意到 Countries 部分被重复了。更新了答案。
        【解决方案5】:

        使用基础 R,

        replace(v1 <- with(df, ave(as.character(Country), ID, FUN = toString)), duplicated(v1), NA)
        
        #[1] "Poland, Romania, France" NA      NA    "Spain, Portugal, UK"     NA        NA    "Germany"      "Luxembourg"              "Greece"                  "Estonia, Lithuania"     
        #[11] NA 
        

        【讨论】:

          猜你喜欢
          • 2021-06-16
          • 1970-01-01
          • 2018-08-20
          • 1970-01-01
          • 2018-05-01
          • 1970-01-01
          • 2023-03-14
          • 2020-09-12
          • 2018-10-14
          相关资源
          最近更新 更多