匹配来自 r 中多列的数据答案

【问题标题】：Matching data from multiple columns in r匹配来自 r 中多列的数据
【发布时间】：2018-09-14 11:50:18
【问题描述】：

我有两个数据集：

Contacts2：这包含约 100,000 个联系人的列表、他们各自的标题和一组描述可能涉及的工作联系人类型的列。这是一个示例数据集：

First<-c("George","Thomas","James","Jimmy","Howard","Herbert")
Last<-c("Washington", "Jefferson", "Madison", "Carter", "Taft", "Hoover")
Title<-c("CEO", "Accountant","Communications Specialist", "President", "Accountant", "CFO")
Finance<-NA
Executive<-NA
Communications<-NA

Contacts2<-as.data.frame(cbind(First,Last,Title,Finance,Executive,Communications))

    First       Last                     Title Finance Executive Communications
1  George Washington                       CEO    <NA>      <NA>           <NA>
2  Thomas  Jefferson                Accountant    <NA>      <NA>           <NA>
3   James    Madison Communications Specialist    <NA>      <NA>           <NA>
4   Jimmy     Carter                 President    <NA>      <NA>           <NA>
5  Howard       Taft                Accountant    <NA>      <NA>           <NA>
6 Herbert     Hoover                       CFO    <NA>      <NA>           <NA>

注意最后三列是数字。

TableOfTitle：此数据集包含约 1,000 个唯一标题的列表以及描述联系人可能参与的工作类型的相同列集。对于每个标题，我在列中添加了 1描述该人工作的角色。

Title<-c("CEO","Accountant", "Communications Specialist", "President", "CFO")
Finance<-c(NA,1,NA,1,1)
Executive<-c(1,NA,NA,NA,1)
Communications<-c(NA,NA,1,NA,NA)
TableOfTitle<-as.data.frame(cbind(Title,Finance,Executive,Communications))

                      Title Finance Executive Communications
1                       CEO    <NA>         1           <NA>
2                Accountant       1      <NA>           <NA>
3 Communications Specialist    <NA>      <NA>              1
4                 President       1      <NA>           <NA>
5                       CFO       1         1           <NA>

注意最后三列是数字。

我现在尝试根据联系人标题字段匹配Contacts2 中TableOfTitle 中的复选框。例如，由于TableOfTitle 显示任何头衔为 CFO 的人都应该在 Finance and Executive 字段中有一个 x，所以 Contacts2 中 Herbert Hoover 的记录在这些列中也应该有 1。

【问题讨论】：

为什么不merge两套？
使用merge(Contacts2, TableOfTitle, by = "Title", all.x = TRUE)执行左连接
而且您不需要第一个表中的 Finance、Executive 和 Communications 列，因为它们将添加到联接中。
这可能是一个选项，但是，我想要做的是有点复杂。在我的实际数据集中，Contacts2 的最后三列中已经有一些值。我最终会写一个规则，如果一个值已经存在，不要替换。
然后在合并之前重命名第二个数据集中的列，然后运行 ifelse 逻辑。

标签： r match subset lookup

【解决方案1】：

这是一个使用dplyr 的解决方案。这基本上是一些评论者已经推荐的，除了这满足了不复制Contacts2最后3列中任何预先存在的数据的要求。

请注意，ifelse() 对于大型数据集可能会非常慢，但对于您声明的任务，这应该不会很明显。从算法上讲，这个解决方案在其他方面也有点笨拙，但我在这里追求最大的可读性。

Contacts2 <- left_join(Contacts2, TableOfTitle, by = "Title") %>%
             transmute(First = First,
                       Last = Last,
                       Title = Title,
                       Finance = ifelse(is.na(Finance.x), Finance.y, Finance.x),
                       Executive = ifelse(is.na(Executive.x), Executive.y, Executive.x),
                       Communications = ifelse(is.na(Communications.x), Communications.y, Communications.x))

示例输出：

First        Last                     Title Finance Executive Communications
George Washington                       CEO    <NA>        1           <NA>
Thomas  Jefferson                Accountant      1       <NA>          <NA>
James     Madison Communications Specialist    <NA>      <NA>            1
Jimmy      Carter                 President      1       <NA>          <NA>
Howard       Taft                Accountant      1       <NA>          <NA>
Herbert    Hoover                       CFO      1         1           <NA>

【讨论】：