选择其他df中不存在的行答案

【问题标题】：Select the rows that do not exists in other df选择其他df中不存在的行
【发布时间】：2019-07-08 15:48:41
【问题描述】：

我有一个带有行程 ID、停靠点 ID、时间戳和速度的 df。

   trip_id stop_id speed timestamp
 1       1       1     5         1
 2       1       1     0         2
 3       1       1     0         3
 4       1       1     5         4
 5       1       2     2       101
 6       1       2     2       102
 7       1       2     2       103
 8       1       2     2       104
 9       1       3     4       201
10       1       3     0       202

我已经为trip_id和stop_id相同的组保存了速度为零的第一行和最后一行。

df_departure_z <- sqldf("SELECT trip_id, stop_id, MAX(timestamp) FROM df WHERE speed = 0 GROUP BY trip_id,stop_id")
df_arrival_z <- sqldf("SELECT trip_id, stop_id, MIN(timestamp) FROM df WHERE speed = 0 GROUP BY trip_id,stop_id")

结果如下：

df_departure_z：

trip_id stop_id MAX(timestamp)
1       1       1              3
2       1       3            203

df_arrival_z：

trip_id stop_id MIN(timestamp)
1       1       1              2
2       1       3            202

我的问题：有一个停止（停止 2）的速度从不为零，因此我想找到一种方法来为速度从不为零的停止保存一个时间戳。我试过这个：

df_arr_dep <- sqldf("SELECT trip_id, stop_id, MIN(timestamp) FROM df GROUP BY trip_id, stop_id EXCEPT SELECT trip_id, stop_id FROM df_arrival_z ")

但它给了我一个错误，因为我试图根据另一个 df 中两列中的值保存三列。基本上，我想再次搜索我的 df 并找到那些不在 df_departure_z 或 df_arrival_z 中的 trip_id 和 stop_id 组合。如果我尝试使用 SELECT * 我会得到所有未保存的行，这也是错误的。

【问题讨论】：

尝试使用来自dplyr的anti_join
df_departure_z 正确吗？我只看到 trip_id 2 和 stop_id 3 的一个 0 速度条目，timestamp 为 202。

标签： r sqldf not-exists

【解决方案1】：

除了sqldf，你还能使用其他库吗？我认为以下内容可以使用dplyr 完成您正在寻找的内容：

library(dplyr)

dat %>%
  group_by(trip_id, stop_id) %>%
  filter(speed == 0 | sum(speed == 0) == 0) %>%
  summarize(min_time = min(timestamp),
            max_time = if_else(sum(speed == 0) == 0,
                               NA_real_,
                               max(timestamp)))

# A tibble: 3 x 4
# Groups:   trip_id [?]
  trip_id stop_id min_time max_time
    <int>   <int>    <dbl>    <dbl>
1       1       1        2        3
2       1       2      101       NA
3       1       3      202      202

数据

dat <- structure(list(trip_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
                      stop_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), 
                      speed = c(5L, 0L, 0L, 5L, 2L, 2L, 2L, 2L, 4L, 0L),
                      timestamp = c(1L, 2L, 3L, 4L, 101L, 102L, 103L, 104L, 201L, 202L)),
                 .Names = c("trip_id", "stop_id", "speed", "timestamp"), 
                 row.names = c(NA, -10L),
                 class = "data.frame")

【讨论】：

【解决方案2】：

如果我对每次行程和停车的理解正确，您希望该行具有速度为零的最大时间戳，或者如果没有这样的行，则该组中速度为非 0 的行的最大时间戳行。再往下，我们做出另一个假设，即在组中没有 0 速度行的情况下，只需使用 NA。之后我们讨论问题中的 EXCEPT 查询。

在上面的第一种情况下，按行程、停止和速度 == 0 分组。如果有 0 和非 0 速度，这将给出每次行程 2 行并停止，如果存在速度将给出 1 行行程行程并停止只是非0速度。在每个组中，我们将速度 == 0 的行设为最大。由于 TRUE > FALSE，那么如果有两行，它将采用速度为 0 的行，否则它将采用单个非零速度行。

sqldf("SELECT trip_id, stop_id, timestamp, MAX(speed0) speed0
  FROM 
    (SELECT trip_id, stop_id, MAX(timestamp) timestamp, speed == 0 speed0
    FROM df 
    GROUP BY 1, 2, 4)
  GROUP BY 1, 2")

给予：

  trip_id stop_id timestamp speed0
1       1       1         3      1
2       1       2       104      0
3       1       3       202      1

speed0 的第 1 行中的 1 表示为该组找到了 speed == 0 行，因此它使用了该组的 speed == 0 行的最大时间戳。同样，在第 2 行中，speed0 的 0 表示没有为该组找到 speed == 0 行，因此它使用该组中非 0 行的最大时间戳。

如果您不想要第 4 列，只需在结尾添加 [-4] ）。

其他解释

如果您想要的是 NA 在那些行中没有速度 == 0，那么只需替换上面 sql 中的第一行，如下所示：

sqldf("SELECT trip_id, stop_id, NULLIF(MAX(speed0) * timestamp, 0) timestamp
  FROM 
    (SELECT trip_id, stop_id, MAX(timestamp) timestamp, speed == 0 speed0
    FROM df 
    GROUP BY 1, 2, 4)
  GROUP BY 1, 2")

给予：

  trip_id stop_id timestamp
1       1       1         3
2       1       2        NA
3       1       3       202

另一种方法是使用左连接给出相同的结果：

sqldf("WITH a(trip_id, stop_id) AS (
         SELECT distinct trip_id, stop_id
         FROM df),
      b(trip_id, stop_id, timestamp) AS (
         SELECT trip_id, stop_id, MAX(timestamp) timestamp
         FROM df
         WHERE speed == 0
         GROUP BY 1, 2)
      SELECT *
      FROM a LEFT JOIN b
      USING (trip_id, stop_id)")

除了与不存在

关于问题中涉及 EXCEPT 的最后一行代码，可以使用以下所示的涉及 NOT EXISTS 的相关子查询来完成，如下所示：

sqldf("SELECT a.trip_id, a.stop_id, MIN(a.timestamp) timestamp
  FROM df a
  WHERE NOT EXISTS  (
    SELECT *
    FROM df b
    WHERE speed == 0 AND a.trip_id = b.trip_id AND a.stop_id = b.stop_id)
  GROUP by 1, 2")

给予：

  trip_id stop_id timestamp
1       1       2       101

注意

我们假设此输入以可重现的形式显示：

Lines <- "
   trip_id stop_id speed timestamp
 1       1       1     5         1
 2       1       1     0         2
 3       1       1     0         3
 4       1       1     5         4
 5       1       2     2       101
 6       1       2     2       102
 7       1       2     2       103
 8       1       2     2       104
 9       1       3     4       201
10       1       3     0       202"
df <- read.table(text = Lines)

【讨论】：