Bigquery 根据时间/位置数据获取速度（当前行上方/下方的行）答案

【问题标题】：Bigquery to get Speed based on time/location data (rows above/below current row)Bigquery 根据时间/位置数据获取速度（当前行上方/下方的行）
【发布时间】：2019-07-31 01:25:05
【问题描述】：

我在 Bigquery 中有一个表格，其中包含 Nascar 驱动程序的跟踪数据（我正在从事的项目的虚拟数据）。 x 和 y 坐标每秒取 10 次。 capture_frame 表示当前帧，每个连续的capture_frame 应该相隔 100 毫秒，因为每 100 毫秒获取一次数据。

我想计算每个车手每圈的速度。我知道如何在 pandas 中做到这一点，但我认为这在 bigquery 中是可能的。为了计算速度，我查看capture_frame 之前的 2 行和之后的 2 行，然后除以纪元时间的差异，这应该是 400 毫秒。

以下是一位车手在第一圈的 1 场比赛的一些捕捉帧示例。每圈有几百个捕捉帧，然后混合了 20 位车手，但如果我们只看一位车手/比赛/单圈，会更容易理解。

+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+
| Race | Capture | Lap | Driver | …  | X    | Y   | Epoch_time | Delta_dist  | Curr_speed  |
|      | _frame  |     |        |    |      |     |            |             |             |
+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+
| I500 | 1       | 1   | Logano | …. | 2.1  | 1   | 1552089720 | NULL        | Null        |
+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+
| I500 | 2       | 1   | Logano | …  | 2.2  | 1.1 | 1552089820 | NULL        | Null        |
+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+
| I500 | 3       | 1   | Logano | …  | 2.22 | 1.2 | 1552089920 | 2.265921446 | 0.005664804 |
+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+
| I500 | 4       | 1   | Logano | .. | 3.22 | 1.5 | 1552090020 | 3.124163888 | 0.00781041  |
+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+
| I500 | 5       | 1   | Logano | .. | 4.22 | 1.8 | 1552090120 | NULL        | null        |
+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+
| I500 | 6       | 1   | Logano | .. | 5.22 | 1.9 | 1552090220 | NULL        | null        |
+------+---------+-----+--------+----+------+-----+------------+-------------+-------------+

第 3 帧的 delta_dist 由 sqrt((4.22-2.1)^2 + (1.8-1)^2)/1 计算得出，curr_speed 是该数字除以 400。比赛的前/后 2 个距离和速度将为空，因为没有先前的 x 或y 坐标没关系，因为当您距离启动或停止 0.1 秒时实际上没有任何速度。

在 pandas 中我会这样做（这不是很好的代码，因为我只是让每个车手自己参加比赛）：

#laps_per_race dictionary with num laps per race
for driver in driver_list:
    for race in race_list:
        driver_race_query = “SELECT * from nascar_xyz where driver={driver} and Race={race}”.format(driver=driver, race=race)
        df_entire_race = client.query(driver_race_query).to_dataframe()
        num_laps = laps_per_race[race]
        for lap in num_laps: 
            #get subset of dataframe just for this lap 
            df = df_entire_race.loc[df_entire_race['Lap'] == lap]
            df.sort_values(‘Epoch_time’, inplace=True)
            df[‘prev_x’] = df[‘X’].shift(2)
            df[‘next_x’] = df[‘X’].shift(-2)
            df[‘prev_y’] = df[‘Y’].shift(2)
            df[‘next_y’] = df[‘Y’].shift(-2)
            #this is just distance function sqrt((x2-x1)^2 + (y2-y1)^2)
            df['delta_dist'] = np.sqrt((df[‘X’].shift(-2) - df[‘X’].shift(2))**2 + (df[‘Y’].shift(-2) - df[‘Y’].shift(2))**2))

            #400.0 is the time actual difference
            df['Curr_speed'] = df['delta_dist']/400.0

我认为在我的 sql 查询中，我要么必须进行分组或分区，因为我想通过driver_id 来查看每场比赛，然后是 lap（如果该抽象级别有意义的话）。也许为了速度和向前看 capture_frames，我可以用窗口 (https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts) 或称为 lag 的东西来做一些事情，这似乎相当于 pandas 中的.shift()。

【问题讨论】：

切向评论：我很惊讶这个问题在不到 5 分钟的时间内获得了 3 次投票。
目前还不清楚 - 您期望得到什么输出。请您提供示例，这样我们就可以提供帮助而无需过多猜测

标签： python sql pandas google-bigquery geospatial

【解决方案1】：

你走在正确的道路上。我将获取在史泰登岛周围行驶的公共汽车数据集 - 我将通过查看它们的纬度、经度来使用地理距离：

WITH data AS (
  SELECT bus, ST_GeogPoint(longitude, latitude) point
    , PARSE_TIMESTAMP('%Y%m%d %H%M%S',FORMAT('%i %06d', day, time)) ts
  FROM `fh-bigquery.mta_nyc_si.201410_bustime`
  WHERE day=20141014
  AND bus IN (7043, 7086, 7076, 2421, 7052, 7071)
)


SELECT * 
FROM (
  SELECT bus, ts, distance/time speed
  FROM (
    SELECT bus, ts
      , ST_DISTANCE(point, LAG(point, 3) OVER(PARTITION BY bus ORDER BY ts)) distance
      , TIMESTAMP_DIFF(ts, LAG(ts, 3) OVER(PARTITION BY bus ORDER BY ts), SECOND) time
    FROM data
  )
  WHERE time IS NOT null 
)
WHERE speed < 500

【讨论】：