合并两个数据框，其中一列根据条件匹配答案

【问题标题】：Merge two data frame where one column is matched based on condition合并两个数据框，其中一列根据条件匹配
【发布时间】：2014-10-19 20:45:54
【问题描述】：

模拟数据：

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

df1 包含感兴趣的国家/地区年份。我想获得这个国家年和前后 3 年的 myvar 值。

换句话说，合并是基于df2$country==df1$country AND df2$year > df1$year - 3 & df2$year < df1$year + 3 的条件完成的

编辑：我的（有效的，不优雅的）解决方案是填充 df1 以创建我感兴趣的所有国家/地区年份，然后以常规方式与 df2 合并。

library(plyr)
ddply(df1, c("country", "year"), 
  function(df) data.frame(rep(df$country, 7), (df$year-3):(df$year+3)))

生产

   country year rep.df.country..7. X.df.year...3...df.year...3.
1       UK 2003                 UK                         2000
2       UK 2003                 UK                         2001
3       UK 2003                 UK                         2002
4       UK 2003                 UK                         2003
5       UK 2003                 UK                         2004
6       UK 2003                 UK                         2005
7       UK 2003                 UK                         2006
8       US 2000                 US                         1997
9       US 2000                 US                         1998
10      US 2000                 US                         1999
11      US 2000                 US                         2000
12      US 2000                 US                         2001
13      US 2000                 US                         2002
14      US 2000                 US                         2003

【问题讨论】：

现在没有时间给出完整的答案，但如果我可以的话，稍后会发布一个。如果您使用它们，请从包 data.table 中查看新功能 foverlaps。用来回答this question

标签： r merge dataframe

【解决方案1】：

合并适用于哪些方面？这听起来像是一个子集问题，除非我误解了这个问题（正如我承认的那样）

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))


f <- lapply(df1$country, function(x) {
  tmp <- df2[df2$country == x, ]
  tmp[abs(tmp$year - df1[df1$country == x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361

编辑

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))

f <- lapply(seq_len(nrow(df1)), function(x) {
  y <- df1[x, 'country']
  tmp <- df2[df2$country == y, ]
  tmp[abs(tmp$year - df1[x, 'year']) <= 3, ]
})


do.call(rbind, f)

#    country year       myvar
# 1       US 2000 -0.62645381
# 3       US 2002 -0.83562861
# 11      US 2000  1.51178117
# 13      US 2002 -0.62124058
# 2       UK 2001  0.18364332
# 4       UK 2003  1.59528080
# 6       UK 2005 -0.82046838
# 12      UK 2001  0.38984324
# 14      UK 2003 -2.21469989
# 16      UK 2005 -0.04493361
# 7       US 2006  0.48742905
# 9       US 2008  0.57578135
# 17      US 2006 -0.01619026
# 19      US 2008  0.82122120
# 8       UK 2007  0.73832471
# 10      UK 2009 -0.30538839
# 18      UK 2007  0.94383621
# 20      UK 2009  0.59390132

【讨论】：

这适用于模拟数据，但不适用于df1$country 不是唯一的更一般的设置（因为这部分df1[df1$country == x, 'year']）。请注意，我的工作解决方案适用于一般情况（我认为）。（发布过于简化的模拟数据对我不利）。
每个非唯一国家/地区是否有不同的年份？如果没有，您可以使用unique(df1[...)
是的，非唯一国家/地区的不同年份是问题

【解决方案2】：

在 data.table 中使用 foverlaps 的试验

set.seed(1)
df1 <- data.frame(country=c("US", "UK"),
                  year=c(2000, 2003, 2009, 2009))
df2 <- data.frame(country=rep(c("US", "UK"), 10),
                  year=rep(2000:2009, 2),
                  myvar=rnorm(20))
library(data.table)
setDT(df1); setDT(df2) # convert to data table
df1[, c("start", "end") := list(year-2, year+2)]
setkey(df1, country, start, end)
setkey(df2[, year2:=year], country, year, year2)
foverlaps(df1, df2, type="any")[,4:7:=NULL][]
    country year       myvar
 1:      UK 2001  0.18364332
 2:      UK 2001  0.38984324
 3:      UK 2003  1.59528080
 4:      UK 2003 -2.21469989
 5:      UK 2005 -0.82046838
 6:      UK 2005 -0.04493361
 7:      UK 2007  0.73832471
 8:      UK 2007  0.94383621
 9:      UK 2009 -0.30538839
10:      UK 2009  0.59390132
11:      US 2000 -0.62645381
12:      US 2000  1.51178117
13:      US 2002 -0.83562861
14:      US 2002 -0.62124058
15:      US 2008  0.57578135
16:      US 2008  0.82122120

【讨论】：

我认为它一定是year-2 和year+2（因为foverlaps 工作在封闭区间）？现在我看到了你之前关于foverlaps 的问题的原因。会修复。谢谢。
小注：如果两个表的key都设置得当，那么by.x和by.y就不需要了。
由.x 由.y 删除。 y-3 和 year+3 应正确包含相关年份。
再说一次，我不这么认为。条件为df1$year - 3 < df2$year < df1$year+3；即(df1$year-3, df1$year+3)。但是foverlaps 假定为封闭区间 - [df1$year-3, df1$year+3]，因此也会在边界上匹配。我不认为这是必需的。此外，IIUC foverlaps 命令必须是 foverlaps(df2, df1, type="within", nomatch=0L) - 检查 df2 区间是否在 df1 的区间内，df1 是查找。
天哪，这太丑了。我认为数据表应该很优雅:) @Arun

【解决方案3】：

使用data.table 的一个可能简单的解决方案

library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation    

# convert datasets into data.table
  setDT(df1)
  setDT(df2)


# create conditional columns in df1
  df1[, yearplus3  :=  year +3 ][, yearminus3 := year - 3 ]

# merge
    output <- df1[df2, on = .(country = country ,                # condition 1
                              yearminus3 < year,                 # condition 2
                              yearplus3  > year), nomatch = 0 ,  # condition 3
                  .(country, year,  myvar )]  # indicate columns in the output


output 
 >   country year       myvar
 >1:      US 2000 -0.62645381
 >2:      UK 2003  0.18364332
 >3:      US 2000 -0.83562861
 >4:      UK 2003  1.59528080
 >5:      UK 2003 -0.82046838
 >6:      US 2000  1.51178117
 >7:      UK 2003  0.38984324
 >8:      US 2000 -0.62124058

ps。请注意，截至今天（2016 年 5 月 12 日），参数 on = 仍处于 data.table 的开发版本中

【讨论】：

您可能应该提到，这目前正在开发中并且会发生变化..（非 equi 连接部分）。