【问题标题】:how to merge two column of a dataset to one column of other data set如何将数据集的两列合并到其他数据集的一列
【发布时间】:2019-02-27 06:14:41
【问题描述】:

我有两个数据集如下

full.name 是一列,全名的第一个是 df1 中 full.namecountry 的第一个单词不正确,所以我想将 df1(full.namefirst of full name) 与 column(name )of df2 如果 df1 的两列中的任何一列与 df2 的列匹配,则相应的它应该打印更正的国家值,如果 df1 的全名的 full.nameandfirst 与 df2 的名称不匹配,那么它应该打印值full.nameandfirst of full name and NA in the value of name 和更正的国家

df1:

full.name    first of full name  country
karachi east  karachi            pakistan
phu my        phu                england
phu my        phu                india
delhi         delhi              china
west australia west              england
west australia west              australia
abu dhabai     abu               xyz
south africa   south             africa

df2:

name            corrected.country
karachi         pakistan 
phu my          england
delhi           India
west australia  australia
abu             dubai

我希望我的输出为

full.name    first of full name  country     name          corrected country
karachi east  karachi            pakistan    karachi        pakistan 
phu my        phu                england     phu my         england
phu my        phu                india       phu my         england
delhi         delhi              china       delhi          India
west australia west              england     west australia australia
west australia west              australia   west australia australia
abu dhabai     abu               xyz         abu            dubai
south africa   south             africa      NA              NA

如果任何 df1 列与 df2 (col-name) 匹配,我想匹配 df1 的 full.namefirst of full name 以匹配 df2 的名称,然后在输出中我想要更正的国家列和名称列,如果有的话df1 列与 df2 的名称列匹配 full.namefirst of full name

我知道我让这个 lil 有点复杂,但我真的想解决这个问题,请帮忙

【问题讨论】:

  • 我认为在提出问题时使用标点符号是一个不错的举措。有很大帮助。并且可能给出您的数据的可重复示例,例如通过使用dput(head(df1))df2 相同。
  • 我确信我不是唯一一个觉得这本书难以阅读的人。您能否发布示例 data.frames(即前几行)和示例输出。

标签: r


【解决方案1】:

只要您的数据框中没有重复项,这应该可以工作

library(dplyr)

mutate(inner_join(df1, df2, by = c("full.name"= "name")), name = full.name) %>%
  dplyr::union(., mutate(inner_join(df1, df2, by = c("first.of.full.name" = "name")), name = first.of.full.name)) %>% 
       select(1,2,3,5,4) #just ordering the columns


       full.name first.of.full.name   country           name corrected.country
1         phu my                phu   england         phu my           england
2         phu my                phu     india         phu my           england
3          delhi              delhi     china          delhi             India
4 west australia               west   england west australia         australia
5 west australia               west australia west australia         australia
6   karachi east            karachi  pakistan        karachi          pakistan
7     abu dhabai                abu       xyz            abu             dubai

当您仅合并两个 data.frames 时,合并的两个列成为一个,因此我必须为您的 name-column 仍在结果中找到一种解决方法。

在复制我的代码时,请注意列名。但它们在 R 中应该是相同的。

更新:

包含不在 df2 中的名称:

> df1_2
       full.name first.of.full.name   country
1   karachi east            karachi  pakistan
2         phu my                phu   england
3         phu my                phu     india
4          delhi              delhi     china
5 west australia               west   england
6 west australia               west australia
7     abu dhabai                abu       xyz
8      Stuttgart          Stuttgart   germany

bind_rows(df3, df1_2[rowSums(sapply(1:2, function(x) df1_2[,x] %in% df2$name)) == 0,])

full.name first.of.full.name   country           name corrected.country
1         phu my                phu   england         phu my           england
2         phu my                phu     india         phu my           england
3          delhi              delhi     china          delhi             India
4 west australia               west   england west australia         australia
5 west australia               west australia west australia         australia
6   karachi east            karachi  pakistan        karachi          pakistan
7     abu dhabai                abu       xyz            abu             dubai
8      Stuttgart          Stuttgart   germany           <NA>              <NA>

df1_2 是你的 df1,有一个新行,df3 是上面的结果。

【讨论】:

  • Emily Kothes 和我的回答不同。你说你用过我的。请问你为什么现在接受她的答案是正确的?
  • 好的,我接受你的,但感谢你们的帮助
  • 谢谢。如果您也想给 Emily 留下礼物,您可以投票赞成她的回答。这也是一种奖励。
  • 很抱歉再次打扰您,我可以问您另一个与此示例相关的问题,我如何找到与 df2 中的列名不匹配但它们在 df1 中的值,我可以编辑示例吗?请帮我解决这个问题
  • @Humpelstielzchen 你能帮忙吗?我已经尝试了很多东西,但我没有得到完全不匹配的值
【解决方案2】:

我首先重新创建您的数据集。您不需要执行此部分,因为您已经拥有自己的数据,但我将其包含在此处是为了其他想要重现该解决方案的人。

df1 <- data.frame(stringsAsFactors=FALSE,
            full.name = c("karachi east", "phu my", "phu my", "delhi",
                          "west australia", "west australia", "abu dhabai"),
   first.of.full.name = c("karachi", "phu", "phu", "delhi", "west", "west",
                          "abu"),
              country = c("pakistan", "england", "india", "china", "england",
                          "australia", "xyz"))
df2 <- data.frame(stringsAsFactors=FALSE,
                name = c("karachi", "phu my", "delhi", "west australia", "abu"),
   corrected.country = c("pakistan", "england", "India", "australia", "dubai")
)

现在,加载 dplyr 包。您可以使用 inner_join 将每个“关键”变量(即 full.name 和 first.of.full.name)匹配到 df2,然后使用 union() 将两组数据连接在一起。

library(dplyr)

df3 <- union(inner_join(df1, df2, by = c("first.of.full.name" = "name")) , 
      inner_join(df1, df2, by = c("full.name" = "name")))

df3
#>        full.name first.of.full.name   country corrected.country
#> 1   karachi east            karachi  pakistan          pakistan
#> 2          delhi              delhi     china             India
#> 3     abu dhabai                abu       xyz             dubai
#> 4         phu my                phu   england           england
#> 5         phu my                phu     india           england
#> 6 west australia               west   england         australia
#> 7 west australia               west australia         australia

如果你把它分成不同的步骤,那就是

library(dplyr)

df3 <- inner_join(df1, df2, by = c("first.of.full.name" = "name"))
df3
#>      full.name first.of.full.name  country corrected.country
#> 1 karachi east            karachi pakistan          pakistan
#> 2        delhi              delhi    china             India
#> 3   abu dhabai                abu      xyz             dubai

df4 <- inner_join(df1, df2, by = c("full.name" = "name"))
df4
#>        full.name first.of.full.name   country corrected.country
#> 1         phu my                phu   england           england
#> 2         phu my                phu     india           england
#> 3          delhi              delhi     china             India
#> 4 west australia               west   england         australia
#> 5 west australia               west australia         australia

df5 <- union(df3, df4)
df5
#>        full.name first.of.full.name   country corrected.country
#> 1   karachi east            karachi  pakistan          pakistan
#> 2          delhi              delhi     china             India
#> 3     abu dhabai                abu       xyz             dubai
#> 4         phu my                phu   england           england
#> 5         phu my                phu     india           england
#> 6 west australia               west   england         australia
#> 7 west australia               west australia         australia

reprex package (v0.2.0) 于 2019 年 2 月 27 日创建。

【讨论】:

    猜你喜欢
    • 2010-10-02
    • 2021-01-28
    • 2011-02-21
    • 2011-10-16
    • 1970-01-01
    • 2022-08-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多