【问题标题】:Combine two rows into one based on column value in r根据 r 中的列值将两行合二为一
【发布时间】:2017-07-16 20:16:48
【问题描述】:

请忽略这部分看下面@从这里开始

我正在尝试合并以下两行:

像这样排成一行:

创建数据集的代码如下:

dataset <- data.frame(Environment=c("PRODUCTION","PRODUCTION"),
                      Green=c("Yes","No"),
                      Red=c("No","Yes"),
                      Completed=c("Yes","Yes"))

如果Environment 列具有相同的值,在这种情况下PRODUCTION 合并两行并返回“是”。我没有包含代码,因为我尝试的所有代码都不起作用。例如,这段代码负责复制:

dataset[!duplicated(dataset$Environment),]

任何帮助将不胜感激。

从这里开始 - 问题更新

我意识到我的问题并没有反映我要解决的问题。让我再试一次。这是数据集:

我希望它是这样的:

可能还有很多其他列。但是,我想要做的就是如果相同的ID 有相同的Environment 组合它们并返回Yes 如果有Yes 否则返回默认值。我希望我的措辞更好。

这是新数据集:

dataset <- data.frame(ID=c(15,15,15,16,16,16,16),Environment=c("PRODUCTION","PRODUCTION", "TRAINING",
                                                               "PRODUCTION","PRODUCTION", "TRAINING", "STAGING"),
                      Green=c("Yes","No", "Yes","Yes","No", "Yes", "Yes"),
                      Red=c("No","Yes", "No","No","Yes", "No", "No"),
                      Completed=c("Yes","Yes", "No","Yes","Yes", "No", "No"))

基于@P.Routh 代码,我认为我们更近了一步。我已经修改了数据集以表明静态签名会破坏代码:

dataset <- data.frame(ID=c(15,15,15,16,16,16,16),
                      Environment=c("PRODUCTION","PRODUCTION", "TRAINING",
                      "PRODUCTION","PRODUCTION", "TRAINING", "STAGING"),
                      Green=c("Yes","No", "Yes","Yes","No", "No", "Yes"),
                      Red=c("No","Yes", "No","No","Yes", "No", "No"),
                      White=c("No","No", "No","No","No", "No", "No"),
                      Black=c("No","No", "No","No","No", "No", "No"),
                      Completed=c("Yes","Yes", "No","Yes","Yes", "No", "No"))

有了这个,我想变成这样:

@P.Routh 下面的修改代码给出了错误的输出:

df <- dataset%>%group_by(ID,Environment)%>%
  mutate(total = n())%>%  #this counter acts as the condition you need
  unite(signature,Green,Red,White,Black,Completed,sep = ":")%>% #combines the columns into one column
  mutate(dummy = "Yes:Yes:Yes:Yes:Yes")%>% #just a dummy column to faciliate in specifying the condition
  mutate(new_val = ifelse(total>1,dummy,signature))%>% #this is the condition
  select(-signature:-dummy)%>%
  separate(new_val, c("Green","Red","White","Black","Completed"),":") #restores original output
unique(df)

【问题讨论】:

  • 我们是否需要包含一个条件来检查环境是否有多个值? @LeeS
  • @P.Routh 是正确的。我意识到我的问题是缺乏的。解决方案适用于一个 Environment 值。所以我一直在努力修改问题,请参见上文。
  • 请看看我的解决方案是否有效
  • @P.Routh.. 我看到了。我不得不散步以摆脱看屏幕。我现在正在测试它。
  • 感谢@P.Routh 和其他所有人

标签: r


【解决方案1】:

试试这个,使用dplyrzoo

第一种方法

dataset[dataset=='No']=NA  
dataset%>%group_by(Environment)%>%mutate_each(funs(na.locf))%>%filter(row_number()==n())

  Environment  Green    Red Completed
       <fctr> <fctr> <fctr>    <fctr>
1  PRODUCTION    Yes    Yes       Yes

第二种方法来自@eipi10

dataset %>% group_by(Environment) %>% summarise_all(funs(max(as.character(.)))) 

#For the detail 
    #'Yes'>'No'
    #[1] TRUE

    #max('Yes','No')
    #[1] "Yes"

【讨论】:

  • 我认为你可以这样做:dataset %&gt;% group_by(Environment) %&gt;% summarise_all(funs(max(as.character(.))))
  • @eipi10 谢谢~所以总是让我对新事物感到惊讶!
  • @eipi 和@Wen 这似乎适用于测试数据集。我尝试了你对@Wen 所做的方式。但是,我没有使用mutate_each。谢谢大家,几分钟后通知您
【解决方案2】:

在基础 R 中,您可以像这样使用aggregate

aggregate(dataset[-1], dataset["Environment"], function(x) max(as.character(x)))

返回

  Environment Green Red Completed
1  PRODUCTION   Yes Yes       Yes

在我回答后,这个问题似乎发生了变化。但是,对我的原始代码进行小的改动会产生所需的输出(带有一些行改组)

aggregate(dataset[-(1:2)], dataset[c("Environment", "ID")], 
          function(x) max(as.character(x)))

请注意,这假定字符是按字典顺序排列的,成功之后是失败。如果相反,你可以取最小值。其次,在这种情况下,使用数字代码比使用文本更容易。第二种解决方案是将文本转换为数字以执行上述操作。

【讨论】:

    【解决方案3】:

    使用dplyr 的解决方案。关键是为除Environment 之外的所有列指定因子水平。之后,总结min 的列。 mutate_atsummarise_at 可以有效地完成这项任务。

    # Load package
    library(dplyr)
    
    # Process the data
    dataset2 <- dataset %>%
      # Set factor level to all columns except Environment
      mutate_at(vars(-Environment), factor, levels = c("Yes", "No"), ordered = TRUE) %>%
      group_by(Environment) %>%
      summarise_all(funs(min(.)))
    

    【讨论】:

      【解决方案4】:

      我希望现在还为时不晚。我的解决方案使用dplyrtidyr

      library(dplyr)
      library(tidyr)
      
      df <- dataset%>%group_by(ID,Environment)%>%
      mutate(total = n())%>%  #this counter acts as the condition you need
      unite(signature,Green,Red,Completed,sep = ":")%>% #combines the columns into one column
      mutate(dummy = "Yes:Yes:Yes")%>% #just a dummy column to faciliate in specifying the condition
      mutate(new_val = ifelse(total>1,dummy,signature))%>% #this is the condition
      select(-signature:-dummy)%>%
      separate(new_val, c("Green","Red","Completed"),":") #restores original output
      unique(df)
      

      【讨论】:

      • @P.Routh..你一点也不迟。我不得不散散步..让我测试一下..它适用于我创建的示例数据框。
      • 不错的代码。但是,我认为创建静态签名会破坏您的代码。
      • @LeeS 我同意。代码可以更好。我只是想有创意。抱歉,它不适用于您的原始数据
      • @P.Routh..我感谢您的努力。给了我一些思考。我看了看这个并尝试了不同的方法..无法弄清楚如何根据特定列合并具有不同值的两行。
      【解决方案5】:

      感谢@P.Routh、@Wen 和@eipi10。我采纳了您的所有想法,并提出了可以实际用于我的大型数据集的工作代码。这是上面发布的数据集和有效的代码:

      #load library
      library(dplyr)
      
      #create dataframe
      dataset <- data.frame(ID=c(15,15,15,16,16,16,16),
                            Environment=c("PRODUCTION","PRODUCTION", "TRAINING",
                            "PRODUCTION","PRODUCTION", "TRAINING", "STAGING"),
                            Green=c("Yes","No", "Yes","Yes","No", "No", "Yes"),
                            Red=c("No","Yes", "No","No","Yes", "No", "No"),
                            White=c("No","No", "No","No","No", "No", "No"),
                            Black=c("No","No", "No","No","No", "No", "No"),
                            Completed=c("Yes","Yes", "No","Yes","Yes", "No", "No"))
      
      
      df <- dataset%>%group_by(ID,Environment)%>% mutate(total = n())#add column total for counter of duplicates
      
      ddc<-df[df$total==1,]#subsets those without duplicates
      ddd<-df[df$total==2,]#subsets those with duplicates
      
      ddd<- ddd %>% group_by(ID,Environment) %>% summarise_all(funs(max(as.character(.)))) 
      
      merge(ddc, ddd, all=TRUE)
      

      谢谢大家。

      【讨论】:

        【解决方案6】:

        感谢@P.Routh、@Wen 和@eipi10。我采纳了您的所有想法,并提出了实际适用于我的大型数据集的工作代码。这是上面发布的数据集和有效的代码:

        #load library
        library(dplyr)
        
        #create dataframe
        dataset <- data.frame(ID=c(15,15,15,16,16,16,16),
                              Environment=c("PRODUCTION","PRODUCTION", "TRAINING",
                              "PRODUCTION","PRODUCTION", "TRAINING", "STAGING"),
                              Green=c("Yes","No", "Yes","Yes","No", "No", "Yes"),
                              Red=c("No","Yes", "No","No","Yes", "No", "No"),
                              White=c("No","No", "No","No","No", "No", "No"),
                              Black=c("No","No", "No","No","No", "No", "No"),
                              Completed=c("Yes","Yes", "No","Yes","Yes", "No", "No"))
        
        
        df <- dataset%>%group_by(ID,Environment)%>% mutate(total = n())#add column total for counter of duplicates
        
        ddc<-df[df$total==1,]#subsets those without duplicates
        ddd<-df[df$total==2,]#subsets those with duplicates
        
        ddd<- ddd %>% group_by(ID,Environment) %>% summarise_all(funs(max(as.character(.)))) 
        
        merge(ddc, ddd, all=TRUE)
        

        谢谢大家。

        更新

        我对此进行了更多思考,并意识到我不需要中间的所有其他步骤来折叠行。如果您提供唯一标识符,您的数据完整性将被保留,例如group_by(ID, Environment)。我更进一步并修改了数据集来测试它。请参阅下面的新解决方案:

        dataset <- data.frame(ID=c(15,15,15,15,16,16,16,16),
                              Environment=c("PRODUCTION","PRODUCTION","PRODUCTION", "TRAINING",
                                            "PRODUCTION","PRODUCTION", "TRAINING", "STAGING"),
                              Green=c("Yes","No", "Yes", "Yes","Yes","No", "No", "Yes"),
                              Red=c("No","Yes", "No", "No","No","Yes", "No", "No"),
                              White=c("No","No", "Yes","Yes","No","No", "No", "No"),
                              Black=c("No","No", "No","No","No","No", "No", "No"),
                              Completed=c("Yes","Yes", "No","No","Yes","Yes", "No", "No"))
        
        dataset%>% group_by(ID,Environment) %>% summarise_all(funs(max(as.character(.))))
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-01-11
          • 1970-01-01
          • 2021-12-02
          • 1970-01-01
          • 1970-01-01
          • 2022-10-23
          相关资源
          最近更新 更多