【问题标题】:concatenating values from rows in based on criteria根据条件连接行中的值
【发布时间】:2018-05-16 22:11:33
【问题描述】:

我有一个数据框 df(请参见下面的代码),其中包含近 100,000 行显示我的程序联系人列表的行。该列表有一列显示与联系人关联的程序program 和组织O_ID,并有一列表示联系人在程序中的角色。每当联系人在多个程序中或在程序中具有多个角色时,都会为该联系人创建另一行,并且联系人角色字段值会发生变化。

First   Last    C_ID    OrgName O_ID Program    Role
John    Smith   10045   Acme    901 Buildings   Primary
John    Smith   10045   Acme    901 Buildings   Communications
John    Smith   10045   Acme    901 Homes       Primary
Teddy   Bush    10046   Acme    901 Buildings   Primary
Teddy   Bush    10046   Acme    901 Buildings   Signatory
Jess    Clinton 10050   Consult 904 Homes       Signatory
Jess    Clinton 10050   Consult 904 Homes       Primary
Jess    Clinton 10050   Consult 904 Homes       Communications

出于演示目的,我试图尽量减少行数。具体来说,如果一个联系人在同一个组织和同一个程序中,我只希望联系人出现在一行上(而不是目前的几个),并将联系人角色组合成一个字符串。

我试过这段代码,它部分有效:ddply(df,.(df$C_ID, df$Program, df$O_ID), paste, sep=",")

结果如下:

df$C_ID df$Program df$O_ID                        V1                                 V2
1       10045      Buildings         901         c("John", "John")                c("Smith", "Smith")
2       10045          Homes         901                      John                              Smith
3       10046      Buildings         901       c("Teddy", "Teddy")                  c("Bush", "Bush")
4       10050          Homes         904 c("Jess", "Jess", "Jess") c("Clinton", "Clinton", "Clinton")
                      V3                                 V4               V5                           V6
1        c(10045, 10045)                  c("Acme", "Acme")      c(901, 901)  c("Buildings", "Buildings")
2                  10045                               Acme              901                        Homes
3        c(10046, 10046)                  c("Acme", "Acme")      c(901, 901)  c("Buildings", "Buildings")
4 c(10050, 10050, 10050) c("Consult", "Consult", "Consult") c(904, 904, 904) c("Homes", "Homes", "Homes")
                                           V7
1              c("Primary", "Communications")
2                                     Primary
3                   c("Primary", "Signatory")
4 c("Signatory", "Primary", "Communications")

问题是

1) 列重新排列(注意我的实际数据集中还有更多列)并且列名消失了

2) 唯一更改值的列应位于Role 列中。但是,即使合并的值相同,结果也会合并大多数列的值。例如,在结果列V1(名字列)中,返回c("John", "John")。它应该只是读“约翰”。唯一应该具有不同值的列是列 V7 c("Primary", "Communications")

df<-structure(list(First = c("John", "John", "John", "Teddy", "Teddy", 
"Jess", "Jess", "Jess"), Last = c("Smith", "Smith", "Smith", 
"Bush", "Bush", "Clinton", "Clinton", "Clinton"), C_ID = c(10045L, 
10045L, 10045L, 10046L, 10046L, 10050L, 10050L, 10050L), OrgName = c("Acme", 
"Acme", "Acme", "Acme", "Acme", "Consult", "Consult", "Consult"
), O_ID = c(901L, 901L, 901L, 901L, 901L, 904L, 904L, 904L), 
    Program = c("Buildings", "Buildings", "Homes", "Buildings", 
    "Buildings", "Homes", "Homes", "Homes"), Role = c("Primary", 
    "Communications", "Primary", "Primary", "Signatory", "Signatory", 
    "Primary", "Communications")), .Names = c("First", "Last", 
"C_ID", "OrgName", "O_ID", "Program", "Role"), class = "data.frame", row.names = c(NA, 
-8L))

【问题讨论】:

    标签: r merge dplyr paste


    【解决方案1】:

    paste 需要的是collapse = ", ",而不是sep。使用 collapse 从所有输入中创建一个字符串。为此,我将所有识别列(名称、组织、程序等)分组,然后折叠summarise 中的角色。

    library(tidyverse)
    
    df %>%
      group_by(First, Last, C_ID, OrgName, O_ID, Program) %>%
      summarise(roles_mult = paste(Role, collapse = ", "))
    #> # A tibble: 4 x 7
    #> # Groups:   First, Last, C_ID, OrgName, O_ID [?]
    #>   First Last     C_ID OrgName  O_ID Program   roles_mult                  
    #>   <chr> <chr>   <int> <chr>   <int> <chr>     <chr>                       
    #> 1 Jess  Clinton 10050 Consult   904 Homes     Signatory, Primary, Communi…
    #> 2 John  Smith   10045 Acme      901 Buildings Primary, Communications     
    #> 3 John  Smith   10045 Acme      901 Homes     Primary                     
    #> 4 Teddy Bush    10046 Acme      901 Buildings Primary, Signatory
    

    【讨论】:

    • 为了更简洁,我认为您可以将现有的group_by替换为group_by_at(vars(-Role))
    【解决方案2】:

    您也可以使用dplyr 来完成。

    > df %>% distinct(First, Last, .keep_all=T)
      First    Last  C_ID OrgName O_ID   Program      Role
    1  John   Smith 10045    Acme  901 Buildings   Primary
    2 Teddy    Bush 10046    Acme  901 Buildings   Primary
    3  Jess Clinton 10050 Consult  904     Homes Signatory
    

    【讨论】:

    • 这没有提供我正在寻找的信息。例如,约翰史密斯有多个角色。角色列应显示建筑计划的“主要,通信”。他还应该有一条额外的线路用于参与 Homes 计划。
    • 我现在明白你的意思了。它仍然可以根据您的需要进行其他修改。
    猜你喜欢
    • 1970-01-01
    • 2021-11-28
    • 2021-04-01
    • 1970-01-01
    • 2019-01-13
    • 1970-01-01
    • 2021-01-17
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多