【问题标题】:alternate to FOR loop in comparing two data frames(takes too much time)在比较两个数据帧时替代 FOR 循环(花费太多时间)
【发布时间】:2020-09-21 21:13:31
【问题描述】:

时间(数据帧):



CentralTime                        Batch Id
2020-04-01 03:46:01 UTC
2020-04-01 10:46:01 UTC
2020-04-01 10:54:18 UTC
2020-04-01 10:54:25 UTC
2020-04-01 10:54:31 UTC
2020-04-01 10:55:06 UTC
2020-04-01 10:55:12 UTC
2020-04-01 10:55:26 UTC
2020-04-01 10:55:32 UTC
2020-04-01 10:55:39 UTC
2020-04-01 10:55:45 UTC
2020-04-01 10:56:20 UTC
2020-04-01 10:56:26 UTC
2020-04-01 10:56:33 UTC
2020-04-01 10:56:39 UTC
2020-04-01 10:56:53 UTC
2020-04-01 10:56:59 UTC
2020-04-01 10:57:06 UTC
2020-04-01 10:57:14 UTC
2020-04-01 10:57:20 UTC
2020-04-01 11:37:20 UTC
2020-04-01 11:38:27 UTC
2020-04-01 11:38:33 UTC
2020-04-01 11:38:47 UTC
2020-04-01 11:38:53 UTC
2020-04-01 11:39:15 UTC
2020-04-01 11:39:27 UTC
2020-04-01 11:39:41 UTC
2020-04-01 11:39:47 UTC
2020-04-01 11:39:54 UTC
2020-04-01 11:40:00 UTC
2020-04-01 11:40:28 UTC
2020-04-01 17:30:28 UTC
2020-04-01 17:36:18 UTC
2020-04-02 00:26:18 UTC
2020-04-02 00:28:46 UTC
2020-04-02 00:29:20 UTC
2020-04-02 00:29:28 UTC
2020-04-02 00:29:34 UTC
2020-04-02 00:29:41 UTC
2020-04-02 00:29:47 UTC
2020-04-02 00:30:01 UTC
2020-04-02 00:30:07 UTC
2020-04-02 00:30:21 UTC
2020-04-02 00:30:27 UTC
2020-04-02 00:30:35 UTC
2020-04-02 00:30:42 UTC
2020-04-02 00:30:48 UTC
2020-04-02 00:30:55 UTC
2020-04-02 00:31:01 UTC
2020-04-02 00:31:15 UTC

BatchId(数据帧):

Batch Id         dateTime               nextDate
ABC053272A  2020-04-01 00:00:48 UTC 2020-04-02 00:29:47 UTC
ABC053314A  2020-04-02 00:29:47 UTC 2020-04-03 00:12:58 UTC
ABC053330A  2020-04-03 00:12:58 UTC 2020-04-04 01:16:54 UTC
ABC053355A  2020-04-04 01:16:54 UTC 2020-04-07 00:33:57 UTC
ABC053405A  2020-04-07 00:33:57 UTC 2020-04-08 00:46:47 UTC
ABC053421A  2020-04-08 00:46:47 UTC 2020-04-09 00:36:56 UTC
ABC053447A  2020-04-09 00:36:56 UTC 2020-04-10 01:26:55 UTC
ABC053462A  2020-04-10 01:26:55 UTC 2020-04-13 08:13:50 UTC
ABC053470   2020-04-13 08:13:50 UTC 2020-04-14 10:07:56 UTC
ABC053496A  2020-04-14 10:07:56 UTC 2020-04-15 11:08:59 UTC
ABC053520A  2020-04-15 11:08:59 UTC 2020-04-16 17:51:28 UTC
ABC053553A  2020-04-16 17:51:28 UTC 2020-04-20 04:24:53 UTC
ABC053611A  2020-04-20 04:24:53 UTC 2020-04-22 00:09:56 UTC
ABC053652A  2020-04-22 00:09:56 UTC 2020-04-22 12:05:49 UTC
ABC053652B  2020-04-22 12:05:49 UTC 2020-04-23 14:12:53 UTC
ABC053686   2020-04-23 14:12:53 UTC 2020-04-24 12:14:55 UTC
ABC053694A  2020-04-24 12:14:55 UTC 2020-04-28 00:08:59 UTC
ABC053710A  2020-04-28 00:08:59 UTC 2020-04-29 00:34:56 UTC
ABC053769A  2020-04-29 00:34:56 UTC 2020-04-30 00:59:58 UTC
ABC053793A  2020-04-30 00:59:58 UTC 2020-05-01 00:41:54 UTC
ABC053827A  2020-05-01 00:41:54 UTC 2020-05-05 00:53:55 UTC
ABC053876A  2020-05-05 00:53:55 UTC 2020-05-06 04:10:55 UTC
ABC053892A  2020-05-06 04:10:55 UTC 2020-05-07 06:22:56 UTC
ABC053918A  2020-05-07 06:22:56 UTC 2020-05-08 06:02:55 UTC
ABC053942A  2020-05-08 06:02:55 UTC 2020-05-11 06:43:42 UTC
ABC053967A  2020-05-11 06:43:42 UTC 2020-05-12 07:01:57 UTC
ABC053991A  2020-05-12 07:01:57 UTC 2020-05-13 05:08:47 UTC
ABC054007A  2020-05-13 05:08:47 UTC 2020-05-14 03:36:55 UTC
ABC054023A  2020-05-14 03:36:55 UTC 2020-05-15 02:32:58 UTC
ABC054064A  2020-05-15 02:32:58 UTC 2020-05-18 04:32:57 UTC

我正在尝试根据 CentralTime(时间数据框)是否位于 dateTime(BatchId 数据框)和 nextDate(BatchId 数据之间)从批次 id 列(BatchId 数据框)获取值框架)

我正在使用“for”循环来获取这些值,但它花费了太多时间。试图找到替代解决方案。我刚刚发布了我所拥有的数据子集。下面是代码。

if(nrow(BatchId)!=0){
  for(i in 1:nrow(Time)){
    for(j in 1:nrow(BatchId)){
      if (Time[i,"CentralTime"] < BatchId[j,"nextDate"] & 
            Time[i,"CentralTime"]> BatchId[j,"dateTime"]) {
        Time[i,"batchId"]<-BatchId[j,"Batch Id"]
      }
    }
  }
}

【问题讨论】:

  • 您的样本数据产生零匹配。
  • @r2evans-Time(df) 已编辑

标签: r


【解决方案1】:

for 循环很少是解决 R 中问题的必要(甚至是可取的)方法,这也不例外。事实上,这需要一个“不相等”的连接。 Base R 不支持,而dplyrdbplyr::sql_on 连接时支持,我建议data.table 的方法:

我将创建自己的Time 以便查看一些匹配项:

library(data.table)
Time <- data.frame(CentralTime = BatchId$dateTime[3] + c(0, 1000, 3000, 9000))
Time
#            CentralTime
# 1: 2020-04-03 00:12:58
# 2: 2020-04-03 00:29:38
# 3: 2020-04-03 01:02:58
# 4: 2020-04-03 02:42:58

我假设这两个框架都不属于data.table 类,所以我会小心一点。 (如果您已经在使用data.table,那么您可能知道可以从这段代码中删除什么。如果没有,那么(1)data.table 就地运行,这与 R 的默认写时复制语义不同;(2)这样做需要另一个属性(内存地址),必须在data.table 操作员处理它之前设置它;并且setDTsetDF 分别更改为该格式。我建议反对如果你没有在上面做data.table 的东西,请将其保留为data.table-class 框架,因为有一些基本的 R 框架行为确实会改变。)

library(data.table)
setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate >= CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
#           CentralTime    BatchId
# 1 2020-04-03 00:12:58 ABC053314A
# 2 2020-04-03 00:12:58 ABC053330A
# 3 2020-04-03 00:29:38 ABC053330A
# 4 2020-04-03 01:02:58 ABC053330A
# 5 2020-04-03 02:42:58 ABC053330A

关于data.table如何合并的一些说明:

  • DT1[DT2, on = ...] 是左连接。暂时不考虑非等连接,这个方法类似于

    ### base R
    merge(DT2, DT1, ...)
    
    ### dplyr
    right_join(DT1, DT2, ...)
    left_join(DT2, DT1, ...)
    
  • “左”帧中的时间字段(Time,在我之前的示例中为 DT1)被重命名为另一个帧中使用的非 equi 字段中的第一个,所以如果您查看 @ 987654341@ 在加入后立即具有列BatchIddateTime(即使这些值不一定等于BatchId$dateTime中的任何值...令人困惑)和nextDate

并且不是 data.table 独有的,此连接在其中一个时间产生 两个 行,因为 ids ABC053314AABC053330A 重叠:

subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
#       BatchId            dateTime            nextDate
# 1: ABC053314A 2020-04-02 00:29:47 2020-04-03 00:12:58
# 2: ABC053330A 2020-04-03 00:12:58 2020-04-04 01:16:54

a <- subset(BatchId, BatchId %in% c("ABC053314A", "ABC053330A"))
a$nextDate[1] == a$dateTime[2]
# [1] TRUE

(可能并不总是完全相等,因为它们实际上是浮点数)。

如果你有一个严格的不等式,那么这会减少这种扩展:

setDT(BatchId)
setDT(Time)
out <- BatchId[Time, on = .(dateTime <= CentralTime, nextDate > CentralTime)]
out <- out[, .(CentralTime = dateTime, BatchId)]
setDF(out)
out
#           CentralTime    BatchId
# 1 2020-04-03 00:12:58 ABC053330A
# 2 2020-04-03 00:29:38 ABC053330A
# 3 2020-04-03 01:02:58 ABC053330A
# 4 2020-04-03 02:42:58 ABC053330A

### cleanup
setDF(BatchId)
setDF(Time)

数据:

BatchId <- read.table(header = TRUE, sep = "|", text = "
BatchId    |      dateTime           |     nextDate
ABC053272A | 2020-04-01 00:00:48 UTC | 2020-04-02 00:29:47 UTC
ABC053314A | 2020-04-02 00:29:47 UTC | 2020-04-03 00:12:58 UTC
ABC053330A | 2020-04-03 00:12:58 UTC | 2020-04-04 01:16:54 UTC
ABC053355A | 2020-04-04 01:16:54 UTC | 2020-04-07 00:33:57 UTC
ABC053405A | 2020-04-07 00:33:57 UTC | 2020-04-08 00:46:47 UTC
ABC053421A | 2020-04-08 00:46:47 UTC | 2020-04-09 00:36:56 UTC
ABC053447A | 2020-04-09 00:36:56 UTC | 2020-04-10 01:26:55 UTC
ABC053462A | 2020-04-10 01:26:55 UTC | 2020-04-13 08:13:50 UTC
ABC053470  | 2020-04-13 08:13:50 UTC | 2020-04-14 10:07:56 UTC
ABC053496A | 2020-04-14 10:07:56 UTC | 2020-04-15 11:08:59 UTC
ABC053520A | 2020-04-15 11:08:59 UTC | 2020-04-16 17:51:28 UTC
ABC053553A | 2020-04-16 17:51:28 UTC | 2020-04-20 04:24:53 UTC
ABC053611A | 2020-04-20 04:24:53 UTC | 2020-04-22 00:09:56 UTC
ABC053652A | 2020-04-22 00:09:56 UTC | 2020-04-22 12:05:49 UTC
ABC053652B | 2020-04-22 12:05:49 UTC | 2020-04-23 14:12:53 UTC
ABC053686  | 2020-04-23 14:12:53 UTC | 2020-04-24 12:14:55 UTC
ABC053694A | 2020-04-24 12:14:55 UTC | 2020-04-28 00:08:59 UTC
ABC053710A | 2020-04-28 00:08:59 UTC | 2020-04-29 00:34:56 UTC
ABC053769A | 2020-04-29 00:34:56 UTC | 2020-04-30 00:59:58 UTC
ABC053793A | 2020-04-30 00:59:58 UTC | 2020-05-01 00:41:54 UTC
ABC053827A | 2020-05-01 00:41:54 UTC | 2020-05-05 00:53:55 UTC
ABC053876A | 2020-05-05 00:53:55 UTC | 2020-05-06 04:10:55 UTC
ABC053892A | 2020-05-06 04:10:55 UTC | 2020-05-07 06:22:56 UTC
ABC053918A | 2020-05-07 06:22:56 UTC | 2020-05-08 06:02:55 UTC
ABC053942A | 2020-05-08 06:02:55 UTC | 2020-05-11 06:43:42 UTC
ABC053967A | 2020-05-11 06:43:42 UTC | 2020-05-12 07:01:57 UTC
ABC053991A | 2020-05-12 07:01:57 UTC | 2020-05-13 05:08:47 UTC
ABC054007A | 2020-05-13 05:08:47 UTC | 2020-05-14 03:36:55 UTC
ABC054023A | 2020-05-14 03:36:55 UTC | 2020-05-15 02:32:58 UTC
ABC054064A | 2020-05-15 02:32:58 UTC | 2020-05-18 04:32:57 UTC")
BatchId[c("dateTime", "nextDate")] <-
  lapply(BatchId[c("dateTime", "nextDate")], as.POSIXct, tz = "UTC")

【讨论】:

  • @r2evans-- BatchId df 中的所有行都是唯一的。
  • 我从未说过您在 BatchId 中有重复的行。我确定了两个批次共享一个端点 (dateTime[n+1] == nextDate[n]) 的情况,该端点产生的输出显示两个相同的 BatchId 值。不同的陈述(并已解决)。这会给你预期的输出吗?
  • @r2evans-- 与这些列相比,在 Time(df) 中未获得预期的输出 CentralTime 列 nextDate(BatchId df 中的列)和 dateTime(BatchId df 中的列)不会有相同的值。使用此代码不会产生任何匹配值 out
  • 所以你的out 是空的?当我使用你的数据时,它对我来说是空的,那是因为你的数据没有匹配项。
  • 我更改了 Time(df) 中的数据,请检查此数据
【解决方案2】:

忽略问题的错误,您可以完全删除一个循环。但是,对于语句有多个匹配项的情况,您会期待什么。

if(nrow(BatchId)!=0){
  for(i in 1:nrow(Time)){
    idx <- which(Time[i,"CentralTime"] < BatchId[,"nextDate"] & 
                 Time[i,"CentralTime"] > BatchId[,"dateTime"])
    if(length(idx) > 1)
      stop('more than one match what should I do?')
    Time[i, 'batchId'] <- BatchId[idx, "Batch Id"]
  }
}

然而,@revans 的回答是一个更好的选择,无论是速度还是内存使用。

【讨论】:

  • @Oliver---getting error----"只为同样大小的数据帧定义"
  • 使用您的示例数据,我无法重现该错误。 :-)
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-09-20
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-10-10
  • 2022-01-20
  • 2020-04-07
相关资源
最近更新 更多