R：合并 2 个数据框并将参考数据应用于所有匹配一级的行答案

【问题标题】：R: Merging 2 dataframes and applying reference data to all rows that match by one levelR：合并 2 个数据框并将参考数据应用于所有匹配一级的行
【发布时间】：2017-09-01 15:37:11
【问题描述】：

我有两个数据框：一个（“grny”）主要是参考，但在我之后的“yield”列中也有一些数据，另一个（“txie”）将有“yield”数据有一些 NA 用于丢失数据。我想合并它们，以便在“站点”中具有共同值的行中的所有单元格都是完整的。

大部分逐年数据在哪里：

txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))

主要参考一些逐年收益率数据：

grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))

我想要什么：

         site    yield year prim  lib      lat
1  smithfield 7.009178 1999   nt 1109     43.61828
2  smithfield 8.472677 2000   nt 1109     43.61828
3  belleville 8.857462 1992   nt 122      74.08792
4  belleville 7.368488 1993   nt 122      74.08792
5  belleville       NA 1994   nt 122      74.08792
6  nashua     7.494519 1990   nt 554      49.10000
8  nashua     8.696066 1991   ct 554      49.10000
9  nashua     8.051670 1992   nt 554      49.10000

我尝试过的：

rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as  the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df  (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line

例如，通过外部连接合并，我最终得到：

     site  yield.x year.x prim.x  yield.y year.y prim.y      lat
1 belleville 6.766628   1992   <NA>       NA     NA     nt 34.92136
2 belleville 6.845789   1993   <NA>       NA     NA     nt 34.92136
3 belleville       NA   1994   <NA>       NA     NA     nt 34.92136
4 smithfield 8.841339   1999     nt       NA     NA   <NA> 49.81872
5 smithfield 7.313310   2000     nt       NA     NA   <NA> 49.81872
6     nashua       NA     NA   <NA> 9.173229   1990     ct 49.10000
7     nashua       NA     NA   <NA> 9.196018   1991     nt 49.10000
8     nashua       NA     NA   <NA> 7.336645   1992     ct 49.10000

规定：我想保留已经在“产量”列中的 NA（例如 1994 年的 nashua）。任何答案或有人可以告诉我这种合并的示例在哪里（数据已经在一个或多个共享列中，您没有合并，每个 df bringing in new columns 除了“by”变量）是？

谢谢！！！

【问题讨论】：

我说你的 by 不仅应该在现场，而且应该在 x 年的组合现场，我错了吗？
这个例子可能会令人困惑，但不，可以保持简单，只使用站点作为 by，因为我永远不会为同一个站点添加年份

标签： r merge rbind

【解决方案1】：

使用dplyr 包，您可以执行full_join，然后使用coalesce 函数在yield.x 与yield.y、prim.x 与@987654327 的列对中获取非NA 值@等。

library(dplyr)
full_join(txie,grny,by="site") %>%
mutate(year = coalesce(year.x,.$year.y),
yield = coalesce(yield.x,yield.y),
prim = coalesce(prim.x,prim.y)) %>% 
select(-c(year.x,year.y,yield.x,yield.y,prim.x,prim.y)) 

        site      lat year     yield prim
1 smithfield 59.71994 1999  7.920844   nt
2 smithfield 59.71994 2000 10.122713   nt
3 belleville 34.93358 1992  8.622351   nt
4 belleville 34.93358 1993  7.360470   nt
5 belleville 34.93358 1994        NA   nt
6     nashua 49.10000 1990  9.083390   ct
7     nashua 49.10000 1991  8.073866   nt
8     nashua 49.10000 1992  8.725625   nt

【讨论】：

谢谢！这行得通。对于像我这样的其他新手来说，这只是一个仅供参考（这现在看起来很明显并且很容易解决），我必须首先确保所有具有相同名称的向量在两个 dfs 中都是相同类型的。