在比较两个数据帧时替代 FOR 循环（花费太多时间）答案

【问题标题】：alternate to FOR loop in comparing two data frames(takes too much time)在比较两个数据帧时替代 FOR 循环（花费太多时间）
【发布时间】：2020-09-21 21:13:31
【问题描述】：

时间（数据帧）：



CentralTime                        Batch Id
2020-04-01 03:46:01 UTC
2020-04-01 10:46:01 UTC
2020-04-01 10:54:18 UTC
2020-04-01 10:54:25 UTC
2020-04-01 10:54:31 UTC
2020-04-01 10:55:06 UTC
2020-04-01 10:55:12 UTC
2020-04-01 10:55:26 UTC
2020-04-01 10:55:32 UTC
2020-04-01 10:55:39 UTC
2020-04-01 10:55:45 UTC
2020-04-01 10:56:20 UTC
2020-04-01 10:56:26 UTC
2020-04-01 10:56:33 UTC
2020-04-01 10:56:39 UTC
2020-04-01 10:56:53 UTC
2020-04-01 10:56:59 UTC
2020-04-01 10:57:06 UTC
2020-04-01 10:57:14 UTC
2020-04-01 10:57:20 UTC
2020-04-01 11:37:20 UTC
2020-04-01 11:38:27 UTC
2020-04-01 11:38:33 UTC
2020-04-01 11:38:47 UTC
2020-04-01 11:38:53 UTC
2020-04-01 11:39:15 UTC
2020-04-01 11:39:27 UTC
2020-04-01 11:39:41 UTC
2020-04-01 11:39:47 UTC
2020-04-01 11:39:54 UTC
2020-04-01 11:40:00 UTC
2020-04-01 11:40:28 UTC
2020-04-01 17:30:28 UTC
2020-04-01 17:36:18 UTC
2020-04-02 00:26:18 UTC
2020-04-02 00:28:46 UTC
2020-04-02 00:29:20 UTC
2020-04-02 00:29:28 UTC
2020-04-02 00:29:34 UTC
2020-04-02 00:29:41 UTC
2020-04-02 00:29:47 UTC
2020-04-02 00:30:01 UTC
2020-04-02 00:30:07 UTC
2020-04-02 00:30:21 UTC
2020-04-02 00:30:27 UTC
2020-04-02 00:30:35 UTC
2020-04-02 00:30:42 UTC
2020-04-02 00:30:48 UTC
2020-04-02 00:30:55 UTC
2020-04-02 00:31:01 UTC
2020-04-02 00:31:15 UTC

BatchId（数据帧）：

Batch Id         dateTime               nextDate
ABC053272A  2020-04-01 00:00:48 UTC 2020-04-02 00:29:47 UTC
ABC053314A  2020-04-02 00:29:47 UTC 2020-04-03 00:12:58 UTC
ABC053330A  2020-04-03 00:12:58 UTC 2020-04-04 01:16:54 UTC
ABC053355A  2020-04-04 01:16:54 UTC 2020-04-07 00:33:57 UTC
ABC053405A  2020-04-07 00:33:57 UTC 2020-04-08 00:46:47 UTC
ABC053421A  2020-04-08 00:46:47 UTC 2020-04-09 00:36:56 UTC
ABC053447A  2020-04-09 00:36:56 UTC 2020-04-10 01:26:55 UTC
ABC053462A  2020-04-10 01:26:55 UTC 2020-04-13 08:13:50 UTC
ABC053470   2020-04-13 08:13:50 UTC 2020-04-14 10:07:56 UTC
ABC053496A  2020-04-14 10:07:56 UTC 2020-04-15 11:08:59 UTC
ABC053520A  2020-04-15 11:08:59 UTC 2020-04-16 17:51:28 UTC
ABC053553A  2020-04-16 17:51:28 UTC 2020-04-20 04:24:53 UTC
ABC053611A  2020-04-20 04:24:53 UTC 2020-04-22 00:09:56 UTC
ABC053652A  2020-04-22 00:09:56 UTC 2020-04-22 12:05:49 UTC
ABC053652B  2020-04-22 12:05:49 UTC 2020-04-23 14:12:53 UTC
ABC053686   2020-04-23 14:12:53 UTC 2020-04-24 12:14:55 UTC
ABC053694A  2020-04-24 12:14:55 UTC 2020-04-28 00:08:59 UTC
ABC053710A  2020-04-28 00:08:59 UTC 2020-04-29 00:34:56 UTC
ABC053769A  2020-04-29 00:34:56 UTC 2020-04-30 00:59:58 UTC
ABC053793A  2020-04-30 00:59:58 UTC 2020-05-01 00:41:54 UTC
ABC053827A  2020-05-01 00:41:54 UTC 2020-05-05 00:53:55 UTC
ABC053876A  2020-05-05 00:53:55 UTC 2020-05-06 04:10:55 UTC
ABC053892A  2020-05-06 04:10:55 UTC 2020-05-07 06:22:56 UTC
ABC053918A  2020-05-07 06:22:56 UTC 2020-05-08 06:02:55 UTC
ABC053942A  2020-05-08 06:02:55 UTC 2020-05-11 06:43:42 UTC
ABC053967A  2020-05-11 06:43:42 UTC 2020-05-12 07:01:57 UTC
ABC053991A  2020-05-12 07:01:57 UTC 2020-05-13 05:08:47 UTC
ABC054007A  2020-05-13 05:08:47 UTC 2020-05-14 03:36:55 UTC
ABC054023A  2020-05-14 03:36:55 UTC 2020-05-15 02:32:58 UTC
ABC054064A  2020-05-15 02:32:58 UTC 2020-05-18 04:32:57 UTC

我正在尝试根据 CentralTime（时间数据框）是否位于 dateTime（BatchId 数据框）和 nextDate（BatchId 数据之间）从批次 id 列（BatchId 数据框）获取值框架）

我正在使用“for”循环来获取这些值，但它花费了太多时间。试图找到替代解决方案。我刚刚发布了我所拥有的数据子集。下面是代码。

if(nrow(BatchId)!=0){
  for(i in 1:nrow(Time)){
    for(j in 1:nrow(BatchId)){
      if (Time[i,"CentralTime"] < BatchId[j,"nextDate"] & 
            Time[i,"CentralTime"]> BatchId[j,"dateTime"]) {
        Time[i,"batchId"]<-BatchId[j,"Batch Id"]
      }
    }
  }
}

【问题讨论】：

您的样本数据产生零匹配。
@r2evans-Time(df) 已编辑

标签： r

【解决方案1】：

双for 循环很少是解决 R 中问题的必要（甚至是可取的）方法，这也不例外。事实上，这需要一个“不相等”的连接。 Base R 不支持，而dplyr 与dbplyr::sql_on 连接时支持，我建议data.table 的方法：

我将创建自己的Time 以便查看一些匹配项：

library(data.table)
Time <- data.frame(CentralTime = BatchId$dateTime[3] + c(0, 1000, 3000, 9000))
Time
#            CentralTime
# 1: 2020-04-03 00:12:58
# 2: 2020-04-03 00:29:38
# 3: 2020-04-03 01:02:58
# 4: 2020-04-03 02:42:58

我假设这两个框架都不属于data.table 类，所以我会小心一点。（如果您已经在使用data.table，那么您可能知道可以从这段代码中删除什么。如果没有，那么（1）data.table 就地运行，这与 R 的默认写时复制语义不同；（2）这样做需要另一个属性（内存地址），必须在data.table 操作员处理它之前设置它；并且setDT 和setDF 分别更改为该格式。我建议反对如果你没有在上面做data.table 的东西，请将其保留为data.table-class 框架，因为有一些基本的 R 框架行为确实会改变。）

library(data.table)
setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate >= CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
#           CentralTime    BatchId
# 1 2020-04-03 00:12:58 ABC053314A
# 2 2020-04-03 00:12:58 ABC053330A
# 3 2020-04-03 00:29:38 ABC053330A
# 4 2020-04-03 01:02:58 ABC053330A
# 5 2020-04-03 02:42:58 ABC053330A

关于data.table如何合并的一些说明：

DT1[DT2, on = ...] 是左连接。暂时不考虑非等连接，这个方法类似于

### base R
merge(DT2, DT1, ...)

### dplyr
right_join(DT1, DT2, ...)
left_join(DT2, DT1, ...)

“左”帧中的时间字段（Time，在我之前的示例中为 DT1）被重命名为另一个帧中使用的非 equi 字段中的第一个，所以如果您查看 @ 987654341@ 在加入后立即具有列BatchId、dateTime（即使这些值不一定等于BatchId$dateTime中的任何值...令人困惑）和nextDate

并且不是 data.table 独有的，此连接在其中一个时间产生两个行，因为 ids ABC053314A 和 ABC053330A 重叠：

subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
#       BatchId            dateTime            nextDate
# 1: ABC053314A 2020-04-02 00:29:47 2020-04-03 00:12:58
# 2: ABC053330A 2020-04-03 00:12:58 2020-04-04 01:16:54

a <- subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
a$nextDate[1] == a$dateTime[2]
# [1] TRUE

（可能并不总是完全相等，因为它们实际上是浮点数）。

如果你有一个严格的不等式，那么这会减少这种扩展：

setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate > CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
#           CentralTime    BatchId
# 1 2020-04-03 00:12:58 ABC053330A
# 2 2020-04-03 00:29:38 ABC053330A
# 3 2020-04-03 01:02:58 ABC053330A
# 4 2020-04-03 02:42:58 ABC053330A

### cleanup
setDF(BatchId)
setDF(Time)

数据：

BatchId <- read.table(header = TRUE, sep = "|", text = "
BatchId    |      dateTime           |     nextDate
ABC053272A | 2020-04-01 00:00:48 UTC | 2020-04-02 00:29:47 UTC
ABC053314A | 2020-04-02 00:29:47 UTC | 2020-04-03 00:12:58 UTC
ABC053330A | 2020-04-03 00:12:58 UTC | 2020-04-04 01:16:54 UTC
ABC053355A | 2020-04-04 01:16:54 UTC | 2020-04-07 00:33:57 UTC
ABC053405A | 2020-04-07 00:33:57 UTC | 2020-04-08 00:46:47 UTC
ABC053421A | 2020-04-08 00:46:47 UTC | 2020-04-09 00:36:56 UTC
ABC053447A | 2020-04-09 00:36:56 UTC | 2020-04-10 01:26:55 UTC
ABC053462A | 2020-04-10 01:26:55 UTC | 2020-04-13 08:13:50 UTC
ABC053470  | 2020-04-13 08:13:50 UTC | 2020-04-14 10:07:56 UTC
ABC053496A | 2020-04-14 10:07:56 UTC | 2020-04-15 11:08:59 UTC
ABC053520A | 2020-04-15 11:08:59 UTC | 2020-04-16 17:51:28 UTC
ABC053553A | 2020-04-16 17:51:28 UTC | 2020-04-20 04:24:53 UTC
ABC053611A | 2020-04-20 04:24:53 UTC | 2020-04-22 00:09:56 UTC
ABC053652A | 2020-04-22 00:09:56 UTC | 2020-04-22 12:05:49 UTC
ABC053652B | 2020-04-22 12:05:49 UTC | 2020-04-23 14:12:53 UTC
ABC053686  | 2020-04-23 14:12:53 UTC | 2020-04-24 12:14:55 UTC
ABC053694A | 2020-04-24 12:14:55 UTC | 2020-04-28 00:08:59 UTC
ABC053710A | 2020-04-28 00:08:59 UTC | 2020-04-29 00:34:56 UTC
ABC053769A | 2020-04-29 00:34:56 UTC | 2020-04-30 00:59:58 UTC
ABC053793A | 2020-04-30 00:59:58 UTC | 2020-05-01 00:41:54 UTC
ABC053827A | 2020-05-01 00:41:54 UTC | 2020-05-05 00:53:55 UTC
ABC053876A | 2020-05-05 00:53:55 UTC | 2020-05-06 04:10:55 UTC
ABC053892A | 2020-05-06 04:10:55 UTC | 2020-05-07 06:22:56 UTC
ABC053918A | 2020-05-07 06:22:56 UTC | 2020-05-08 06:02:55 UTC
ABC053942A | 2020-05-08 06:02:55 UTC | 2020-05-11 06:43:42 UTC
ABC053967A | 2020-05-11 06:43:42 UTC | 2020-05-12 07:01:57 UTC
ABC053991A | 2020-05-12 07:01:57 UTC | 2020-05-13 05:08:47 UTC
ABC054007A | 2020-05-13 05:08:47 UTC | 2020-05-14 03:36:55 UTC
ABC054023A | 2020-05-14 03:36:55 UTC | 2020-05-15 02:32:58 UTC
ABC054064A | 2020-05-15 02:32:58 UTC | 2020-05-18 04:32:57 UTC")
BatchId[c("dateTime", "nextDate")] <-
  lapply(BatchId[c("dateTime", "nextDate")], as.POSIXct, tz = "UTC")

【讨论】：

@r2evans-- BatchId df 中的所有行都是唯一的。
我从未说过您在 BatchId 中有重复的行。我确定了两个批次共享一个端点 (dateTime[n+1] == nextDate[n]) 的情况，该端点产生的输出显示两个相同的 BatchId 值。不同的陈述（并已解决）。这会给你预期的输出吗？
@r2evans-- 与这些列相比，在 Time(df) 中未获得预期的输出 CentralTime 列 nextDate(BatchId df 中的列)和 dateTime(BatchId df 中的列)不会有相同的值。使用此代码不会产生任何匹配值 out
所以你的out 是空的？当我使用你的数据时，它对我来说是空的，那是因为你的数据没有匹配项。
我更改了 Time(df) 中的数据，请检查此数据

【解决方案2】：

忽略问题的错误，您可以完全删除一个循环。但是，对于语句有多个匹配项的情况，您会期待什么。

if(nrow(BatchId)!=0){
  for(i in 1:nrow(Time)){
    idx <- which(Time[i,"CentralTime"] < BatchId[,"nextDate"] & 
                 Time[i,"CentralTime"] > BatchId[,"dateTime"])
    if(length(idx) > 1)
      stop('more than one match what should I do?')
    Time[i, 'batchId'] <- BatchId[idx, "Batch Id"]
  }
}

然而，@revans 的回答是一个更好的选择，无论是速度还是内存使用。

【讨论】：

@Oliver---getting error----"只为同样大小的数据帧定义"
使用您的示例数据，我无法重现该错误。 :-)