R - 将两个 Data.Frames 与行级条件变量合并答案

【问题标题】：R - Merging Two Data.Frames with Row-Level Conditional VariablesR - 将两个 Data.Frames 与行级条件变量合并
【发布时间】：2015-07-29 07:16:29
【问题描述】：

短版：我有一个比平常更复杂的合并操作，我想帮助使用 dplyr 或合并进行优化。我已经有很多解决方案，但是这些解决方案在大型数据集上运行速度很慢，我很好奇 R 中是否存在更快的方法（或者在 SQL 或 python 中）

我有两个 data.frames：

与 Store 相关的异步事件日志，以及
一个表格，提供有关该日志中商店的更多详细信息。

问题：商店 ID 是特定位置的唯一标识符，但商店位置的所有权可能会从一个时期到下一个时期发生变化（为了完整起见，没有两个所有者可能同时拥有同一家商店）。因此，当我合并商店级别信息时，我需要某种条件来合并正确时期的商店级别信息。

可重现的例子：

# asynchronous log. 
#  t for period. 
#  Store for store loc ID
#  var1 just some variable. 
set.seed(1)
df <- data.frame(
  t     = c(1,1,1,2,2,2,3,3,4,4,4),
  Store = c(1,2,3,1,2,3,1,3,1,2,3),
  var1 =  runif(11,0,1)
)

# Store table
# You can see, lots of store location opening and closing, 
#  StateDate is when this business came into existence
#  Store is the store id from df
#  CloseDate is when this store when out of business
#  storeVar1 is just some important var to merge over
Stores <- data.frame(
  StartDate = c(0,0,0,4,4),
  Store     = c(1,2,3,2,3),
  CloseDate = c(9,2,3,9,9),
  storeVar1 = c("a","b","c","d","e")
)

现在，我只想合并Store d.f. 中的信息。记录，如果 Store 在此期间营业（t）。 CloseDate 和StartDate 分别表示该业务运营的最后一个时期和第一时期。（为了完整性但不太重要，StartDate0 商店在样本之前就已经存在。对于CloseDate9，商店在样本结束时在该位置还没有倒闭。 em>)

一种解决方案依赖于句点t 级别split() 和dplyr::rbind_all()，例如

# The following seems to do the trick. 
complxMerge_v1 <- function(df, Stores, by = "Store"){
  library("dplyr")
  temp <- split(df, df$t)
  for (Period in names(temp))(
    temp[[Period]] <- dplyr::left_join(
      temp[[Period]],
      dplyr::filter(Stores, 
                    StartDate <= as.numeric(Period) & 
                    CloseDate >= as.numeric(Period)),
      by = "Store"
    )
  )
  df <- dplyr::rbind_all(temp); rm(temp)
  df
}
complxMerge_v1(df, Stores, "Store")

从功能上讲，这似乎有效（反正还没有遇到重大错误）。然而，我们正在处理（越来越常见的）数十亿行日志数据。

如果您想将其用于基准测试，我在 sense.io 上制作了一个更大的可重现示例。见这里：https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals

两个问题：

首先，是否有其他方法可以使用运行速度更快的类似方法来解决此问题？
在 SQL 和 Python 中是否有一个快速简便的解决方案（我不太熟悉，但如果需要可以依赖）。
另外，您能帮我以更笼统、更抽象的方式表达这个问题吗？现在我只知道如何用特定于上下文的术语来讨论问题，但我希望能够用更合适但更通用的编程或数据操作术语来讨论这些类型的问题。

【问题讨论】：

在使用runif等使用随机种子的函数创建可重现示例时，请使用set.seed。

标签： python mysql r merge dplyr

【解决方案1】：

在 R 中，你可以看看 data.table::foverlaps 函数

library(data.table)

# Set start and end values in `df` and key by them  and by  `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]      
setkey(df, Store, StartDate, CloseDate)

# Run `foverlaps` function
foverlaps(setDT(Stores), df)
#     Store t       var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
#  1:     1 1 0.26550866         1         1           0           9         a
#  2:     1 2 0.90820779         2         2           0           9         a
#  3:     1 3 0.94467527         3         3           0           9         a
#  4:     1 4 0.62911404         4         4           0           9         a
#  5:     2 1 0.37212390         1         1           0           2         b
#  6:     2 2 0.20168193         2         2           0           2         b
#  7:     3 1 0.57285336         1         1           0           3         c
#  8:     3 2 0.89838968         2         2           0           3         c
#  9:     3 3 0.66079779         3         3           0           3         c
# 10:     2 4 0.06178627         4         4           4           9         d
# 11:     3 4 0.20597457         4         4           4           9         e

【讨论】：

嗯。在setDT(df)[, c("StartDate", "CloseDate") := .(t, t)] 线上，我收到错误消息：抱歉，不清楚我的问题是什么。 Error in [.data.table(setDT(df), , :=(c("StartDate", "CloseDate"), : RHS of assignment is not NULL, not an an atomic vector (see ?is.atomic) and not a list column.
你有什么版本的data.table？ setDT(df)[, c("StartDate", "CloseDate") := list(t, t)] 有效吗？
这是导致错误的行。我正在使用的 data.table 版本是 1.9.4，data.table_1.9.4。我也在那个 sense.io 链接上重现了同样的错误
不，它的 data.table 语法
嗯，我想知道为什么我测试的两个会话 setDT(df)[, c("StartDate", "CloseDate") := .(t, t)] 产生了相同的错误？ ~ 至于你的替代方案，setDT(df)[, c("StartDate", "CloseDate") := list(t, t)] 或 setDT(df)[, ":="(StartDate = t, CloseDate = t)]，它们的速度只有我建议的想法的 126 倍。谢谢！

【解决方案2】：

您可以转换您的Stores data.frame 添加t-column，其中包含t 的所有值以获得明确的Store，然后使用Hadley 的tydir 包中的unnest 函数将其转换为“长”的形式。

require("tidyr")
require("dplyr")

complxMerge_v2 <- function(df, Stores, by = NULL)    {
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
    unnest(t) %>% left_join(df, ., by = by)
}

complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
#    t Store       var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e

require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")

microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)

# Unit: milliseconds
#                       expr      min       lq      mean    median        uq       max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962    10
# complxMerge_v2(df, Stores)  532.744  539.743  567.7207  561.9635  588.0637  636.5775    10

这里是分步结果，以使过程清晰。

Stores_with_t <- 
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
#   StartDate Store CloseDate storeVar1                            t
# 1         0     1         9         a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2         0     2         2         b                      0, 1, 2
# 3         0     3         3         c                   0, 1, 2, 3
# 4         4     2         9         d             4, 5, 6, 7, 8, 9
# 5         4     3         9         e             4, 5, 6, 7, 8, 9

# After that `unnest(t)`

Stores_with_t_unnest <- 
  with_t %>% unnest(t)
#    StartDate Store CloseDate storeVar1 t
# 1          0     1         9         a 0
# 2          0     1         9         a 1
# 3          0     1         9         a 2
# 4          0     1         9         a 3
# 5          0     1         9         a 4
# 6          0     1         9         a 5
# 7          0     1         9         a 6
# 8          0     1         9         a 7
# 9          0     1         9         a 8
# 10         0     1         9         a 9
# 11         0     2         2         b 0
# 12         0     2         2         b 1
# 13         0     2         2         b 2
# 14         0     3         3         c 0
# 15         0     3         3         c 1
# 16         0     3         3         c 2
# 17         0     3         3         c 3
# 18         4     2         9         d 4
# 19         4     2         9         d 5
# 20         4     2         9         d 6
# 21         4     2         9         d 7
# 22         4     2         9         d 8
# 23         4     2         9         d 9
# 24         4     3         9         e 4
# 25         4     3         9         e 5
# 26         4     3         9         e 6
# 27         4     3         9         e 7
# 28         4     3         9         e 8
# 29         4     3         9         e 9

# And then simple `left_join`

left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store          var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e

【讨论】：