如何用条件向量化 r 中的 for 循环答案

【问题标题】：how to vectorize a for loop in r with conditionals如何用条件向量化 r 中的 for 循环
【发布时间】：2016-02-15 03:01:15
【问题描述】：

我在这个任务上苦苦挣扎了很长一段时间，因此我想我会寻求你的帮助。

在 df1 中，我尝试根据 df1 中的信息以及 df2 中的信息添加一个新列。所以在 df2 中，只要 dfs 中的位置匹配并且 df2 中的时间戳在 df1 给定的时间间隔内，就应该创建一个 ID 列，否则返回0。问题是它们的长度不等。我知道如何编写一个嵌套的 for 循环，但它很丑陋并且需要永远运行。我尝试使用 sapply 作为类似问题的解决方案，但由于 df 长度不同而无法运行

我找到了这个线程 [Speed up the loop operation in R，但是因为我的条件句所依据的 dfs 的长度不同，我无法让这个解决方案起作用。

这是我的数据：

df1 <- structure(list(ID = c(NA, NA, 10035010L), location = c("barge", 
"barge", "barge"), start = structure(c(NA, NA, 
1427301960), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(NA, 
NA, 1437418440), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID", 
"location", "start", "end"), row.names = c(NA, 3L), class = "data.frame")

df2<-structure(list(time = structure(c(1419062220, 1419063120, 1427325120, 
1427325240, 1427325360, 1427325540, 1427325660, 1427326680, 1427568960, 
1427569320, 1427569500), class = c("POSIXct", "POSIXt"), tzone = ""), 
    location = c("barge", "barge", "barge", 
    "barge", "barge", "barge", "barge", 
    "barge", "barge", "barge", "barge"
    )), row.names = c(222195L, 222196L, 186883L, 186884L, 186885L, 
186886L, 186887L, 186888L, 186930L, 186931L, 186932L), class = "data.frame", .Names = c("time", 
"location"))

更新：我决定使用 dplyr 包，因为我觉得使用它很舒服，并在我更大的数据集上使用它。但是，当我包含站 ID 时，由于不同位置的输出不一致，因此出现了问题。

考虑包含站点的相同但略有修改的数据集以查看结果差异：

df3<-structure(list(time = structure(c(1419061860, 1419062220, 1419063120, 
1427325120, 1427325240, 1427325360, 1427325540, 1427325660, 1427326680, 
1427568960, 1427569320), class = c("POSIXct", "POSIXt"), tzone = ""), 
    station = c(104667L, 104667L, 104667L, 124083L, 124083L, 
    124083L, 124083L, 124083L, 124083L, 124083L, 124083L), location = c("barge", 
    "barge", "barge", "barge", "barge", 
    "barge", "barge", "barge", "barge", 
    "barge", "barge")), row.names = 879:889, class = "data.frame", .Names = c("time", "station", "location"))

和

df4<-structure(list(station = c(124083L, 113071L), location = c("barge", 
"barge"), ID = c(10035010L, NA), start = structure(c(1427301960, 
NA), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1437418440, 
NA), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = 3:4, class = "data.frame", .Names = c("station", 
"location", "ID", "start", "end"))

当我运行 dplyr 解决方案时，

df3 %>% left_join(., df4) %>%
  mutate(ID = ifelse(time >= start & time < end, ID, 0))

它不返回相同的输出，即在第一种情况下，返回的数据集是原始数据的倍数，在最后一种情况下，返回的数据集长度相同。我只是不明白为什么它不同。它使使用 filter() 函数变得不可能。任何有关如何解决此问题的建议将不胜感激。谢谢

【问题讨论】：

标签： r loops vectorization

【解决方案1】：

您可以使用dplyr 将两个数据框连接起来并进行如下变异：

library(dplyr)
df2 %>% left_join(., df1) %>%
  mutate(ID = ifelse(time > start & time < end, 1, 0))

输出如下（你可以filter rows with NA 如果你喜欢）：

                  time location ID               start                 end
1  2014-12-20 02:57:00    barge NA                <NA>                <NA>
2  2014-12-20 02:57:00    barge NA                <NA>                <NA>
3  2014-12-20 02:57:00    barge  0 2015-03-25 12:46:00 2015-07-20 14:54:00
4  2014-12-20 03:12:00    barge NA                <NA>                <NA>
5  2014-12-20 03:12:00    barge NA                <NA>                <NA>
6  2014-12-20 03:12:00    barge  0 2015-03-25 12:46:00 2015-07-20 14:54:00
7  2015-03-25 19:12:00    barge NA                <NA>                <NA>
8  2015-03-25 19:12:00    barge NA                <NA>                <NA>
9  2015-03-25 19:12:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
10 2015-03-25 19:14:00    barge NA                <NA>                <NA>
11 2015-03-25 19:14:00    barge NA                <NA>                <NA>
12 2015-03-25 19:14:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
13 2015-03-25 19:16:00    barge NA                <NA>                <NA>
14 2015-03-25 19:16:00    barge NA                <NA>                <NA>
15 2015-03-25 19:16:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
16 2015-03-25 19:19:00    barge NA                <NA>                <NA>
17 2015-03-25 19:19:00    barge NA                <NA>                <NA>
18 2015-03-25 19:19:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
19 2015-03-25 19:21:00    barge NA                <NA>                <NA>
20 2015-03-25 19:21:00    barge NA                <NA>                <NA>
21 2015-03-25 19:21:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
22 2015-03-25 19:38:00    barge NA                <NA>                <NA>
23 2015-03-25 19:38:00    barge NA                <NA>                <NA>
24 2015-03-25 19:38:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
25 2015-03-28 14:56:00    barge NA                <NA>                <NA>
26 2015-03-28 14:56:00    barge NA                <NA>                <NA>
27 2015-03-28 14:56:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
28 2015-03-28 15:02:00    barge NA                <NA>                <NA>
29 2015-03-28 15:02:00    barge NA                <NA>                <NA>
30 2015-03-28 15:02:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00
31 2015-03-28 15:05:00    barge NA                <NA>                <NA>
32 2015-03-28 15:05:00    barge NA                <NA>                <NA>
33 2015-03-28 15:05:00    barge  1 2015-03-25 12:46:00 2015-07-20 14:54:00

【讨论】：

这个解决方案对我来说更直观，但需要 2 个步骤才能获得所需的结果。谢谢。
我发现包含 na.omit () 可以过滤 NA。太棒了！
似乎当我添加站 ID 作为加入的因素时，它不会创建与您提供的相同类型的输出。我不知道如何删除重复的时间行。有什么建议么？我更新了我的 OP。谢谢。
这是因为连接将连接所有公共列并创建“缺失”列。您可以使用left_join 的by 参数来指定要连接的列。在这种情况下，您可以只指定location，这样它就不会按站号加入。

【解决方案2】：

前几天我刚刚使用了一些老式的 SQL 代码来解决类似的问题。试试这个

library(sqldf)

sqldf('
SELECT 
  df2.*
  ,CASE WHEN df1.location is NOT NULL THEN 1 ELSE 0 END AS id
FROM df2
LEFT JOIN df1 ON df2.time > df1.start AND df2.time < df1.end
  ')

如果您在大型数据集上执行此操作，我会避免使用上面的 dplyr 代码，因为在过滤器删除不必要的行之前连接变为笛卡尔。我希望有人尽快在 dplyr 中添加条件连接

【讨论】：

这看起来是我的问题的一个优雅的解决方案。我完全不知道 sqldf 包或 sql 编码，但我会看看小插图。谢谢！

【解决方案3】：

您可以使用outer 将函数应用于任意长度的两个向量。它应该只进行必要的计算（即唯一的组合）。在您的情况下，您将使用 outer 三次进行逻辑测试，并将结果组合成一个逻辑矩阵。

gets_id <- outer(df2$location, df1$location, '==') & 
  outer(df2$time, df1$start, '>=') & 
  outer(df2$time, df1$end, '<=')

这会产生以下输出。 TRUE 值表示location 是数据帧之间的匹配，并且time 介于start 和end 之间。结果中的NA 值归因于start 和end 中的NA 值。

      [,1] [,2]  [,3]
 [1,]   NA   NA FALSE
 [2,]   NA   NA FALSE
 [3,]   NA   NA  TRUE
 [4,]   NA   NA  TRUE
 [5,]   NA   NA  TRUE
 [6,]   NA   NA  TRUE
 [7,]   NA   NA  TRUE
 [8,]   NA   NA  TRUE
 [9,]   NA   NA  TRUE
[10,]   NA   NA  TRUE
[11,]   NA   NA  TRUE

得到结果后，您可以随心所欲地对其进行操作。以下内容适用于您的用例。

assignments <- which(gets_id, arr.ind=TRUE)
df2$id[assignments[,'row']] <- df1$ID[assignments[,'col']]

导致：

                      time location       id
222195 2014-12-20 02:57:00    barge       NA
222196 2014-12-20 03:12:00    barge       NA
186883 2015-03-25 19:12:00    barge 10035010
186884 2015-03-25 19:14:00    barge 10035010
186885 2015-03-25 19:16:00    barge 10035010
186886 2015-03-25 19:19:00    barge 10035010
186887 2015-03-25 19:21:00    barge 10035010
186888 2015-03-25 19:38:00    barge 10035010
186930 2015-03-28 14:56:00    barge 10035010
186931 2015-03-28 15:02:00    barge 10035010
186932 2015-03-28 15:05:00    barge 10035010

【讨论】：

谢谢，最好的解决方案，无需使用任何包即可获得立竿见影的效果。