【问题标题】:Is there a way to resample time series data in BigQuery?有没有办法在 BigQuery 中重新采样时间序列数据?
【发布时间】:2017-07-07 02:39:12
【问题描述】:

pandas DataFrame 有如下resample 方法,我想实现的是通过在 BigQuery 中查询的等效方法。

pandas 中的示例方法
现在我有这样的数据。假设相同的数据存储在 bigquery 中。

In [2]: df.head()
Out[2]: 
                        Open     High      Low    Close  Volume
Gmt time                                                       
2016-01-03 22:00:00  1.08730  1.08730  1.08702  1.08714    8.62
2016-01-03 22:01:00  1.08718  1.08718  1.08713  1.08713    3.75
2016-01-03 22:02:00  1.08714  1.08721  1.08714  1.08720    4.60
2016-01-03 22:03:00  1.08717  1.08721  1.08714  1.08721    7.57
2016-01-03 22:04:00  1.08718  1.08718  1.08711  1.08711    5.52

然后使用 DataFrame 以 5 分钟的频率重新采样数据。

In [3]: ohlcv = {
      :         'Open':'first',
      :         'High':'max',
      :         'Low':'min',
      :         'Close':'last',
      :         'Volume':'sum'
      :         }
      : df = df.resample('5T').apply(ohlcv)  # 5 minutes frequency
      : df = df[['Open', 'High', 'Low', 'Close', 'Volume']]  # reorder columns
      : df.head()
      : 
      : 
Out[3]: 
                        Open     High      Low    Close  Volume
Gmt time                                                       
2016-01-03 22:00:00  1.08730  1.08730  1.08702  1.08711   30.06
2016-01-03 22:05:00  1.08711  1.08727  1.08709  1.08709  190.63
2016-01-03 22:10:00  1.08708  1.08709  1.08662  1.08666  168.79
2016-01-03 22:15:00  1.08666  1.08674  1.08666  1.08667  223.83
2016-01-03 22:20:00  1.08667  1.08713  1.08666  1.08667  170.17

这可以在从 bigquery 获取 1 分钟的频率数据后完成。
但是有没有办法在 bigquery 中 QUERY 重新采样?

编辑

pandas DataFrame resample详解。

                        Open     High      Low    Close  Volume
Gmt time                                                       
# 1 minute frequency data stored in bigquery
2016-01-03 22:00:00  1.08730  1.08730  1.08702  1.08714    8.62
2016-01-03 22:01:00  1.08718  1.08718  1.08713  1.08713    3.75
2016-01-03 22:02:00  1.08714  1.08721  1.08714  1.08720    4.60
2016-01-03 22:03:00  1.08717  1.08721  1.08714  1.08721    7.57
2016-01-03 22:04:00  1.08718  1.08718  1.08711  1.08711    5.52

2016-01-03 22:05:00  1.08711  1.08714  1.08711  1.08711   27.47
2016-01-03 22:06:00  1.08717  1.08720  1.08711  1.08711   21.58
2016-01-03 22:07:00  1.08713  1.08718  1.08712  1.08715   28.12
2016-01-03 22:08:00  1.08714  1.08723  1.08712  1.08718   49.74
2016-01-03 22:09:00  1.08722  1.08727  1.08709  1.08709   63.72

# expected query result
# above will be resampled into below..
2016-01-03 22:00:00  1.08730  1.08730  1.08702  1.08711   30.06
2016-01-03 22:05:00  1.08711  1.08727  1.08709  1.08709  190.63
# method to resample 'first'  'max'    'min'    'last'    'sum'

以 1 分钟的频率将前 5 行(22:00 到 22:04)重新采样为 1 行(22:00),
接下来 5 行(22:05 到 22:09)到 (22:05)。
重采样方法分别为firstmaxminlastsum

first 计算组的第一个值(这里表示 5 行)
max 计算最大值,
min 计算最小值,
last 计算最后一个值,
sum 计算组中列的总和

更多详情请看pandas Document

【问题讨论】:

    标签: python pandas google-bigquery


    【解决方案1】:

    下面试试

    #standardSQL
    SELECT * EXCEPT(step) 
    FROM (
      SELECT *, TIMESTAMP_DIFF(TIMESTAMP(ts), 
                  TIMESTAMP(MIN(ts) OVER(ORDER BY ts)), MINUTE) AS step
      FROM yourTable
    )
    WHERE MOD(step, 5) = 0
    -- ORDER BY ts   
    

    可以通过更改MOD(step, 5) 中的5TIMESTAMP_DIFF 中的MINUTE 来控制采样间隔

    你可以使用下面的虚拟数据来玩这个

    WITH yourTable AS (
      SELECT '2016-01-03 22:00:00' AS ts, 1.08730 AS Open, 1.08730 AS High, 1.08702 AS Low, 1.08714 AS Close, 8.62 AS Volume UNION ALL
      SELECT '2016-01-03 22:01:00', 1.08718, 1.08718, 1.08713, 1.08713, 3.75 UNION ALL
      SELECT '2016-01-03 22:02:00', 1.08714, 1.08721, 1.08714, 1.08720, 4.60 UNION ALL
      SELECT '2016-01-03 22:03:00', 1.08717, 1.08721, 1.08714, 1.08721, 7.57 UNION ALL
      SELECT '2016-01-03 22:04:00', 1.08718, 1.08718, 1.08711, 1.08711, 5.52 UNION ALL
      SELECT '2016-01-03 22:05:00', 1.08718, 1.08718, 1.08713, 1.08713, 3.75 UNION ALL
      SELECT '2016-01-03 22:06:00', 1.08714, 1.08721, 1.08714, 1.08720, 4.60 UNION ALL
      SELECT '2016-01-03 22:07:00', 1.08717, 1.08721, 1.08714, 1.08721, 7.57 UNION ALL
      SELECT '2016-01-03 22:08:00', 1.08718, 1.08718, 1.08711, 1.08711, 5.52 UNION ALL
      SELECT '2016-01-03 22:09:00', 1.08718, 1.08718, 1.08713, 1.08713, 3.75 UNION ALL
      SELECT '2016-01-03 22:10:00', 1.08714, 1.08721, 1.08714, 1.08720, 4.60 UNION ALL
      SELECT '2016-01-03 22:11:00', 1.08717, 1.08721, 1.08714, 1.08721, 7.57 UNION ALL
      SELECT '2016-01-03 22:12:00', 1.08718, 1.08718, 1.08711, 1.08711, 5.52 
    )
    

    以下版本实现“熊猫重采样”(根据更新问题中的逻辑)

    #standardSQL
    SELECT 
      MIN(ts) AS ts,
      ARRAY_AGG(Open ORDER BY ts)[OFFSET (0)] AS Open,
      MAX(High) AS High,
      MIN(Low) AS Low,
      ARRAY_AGG(Close ORDER BY ts DESC)[OFFSET (0)] AS Close,
      SUM(Volume) AS Volume
    FROM (
      SELECT *, DIV(TIMESTAMP_DIFF(TIMESTAMP(ts), 
                  TIMESTAMP(MIN(ts) OVER(ORDER BY ts)), MINUTE), 5) AS grp
      FROM yourTable
    )
    GROUP BY grp
    -- ORDER BY ts
    

    或者进一步简化的版本,只有一个 GROUP BY 和窗口函数。还假设您的数据晚于 '2000-01-01 00:00:00' - 否则您可以相应调整

    #standardSQL
    SELECT 
      MIN(ts) AS ts,
      ARRAY_AGG(Open ORDER BY ts)[OFFSET (0)] AS Open,
      MAX(High) AS High,
      MIN(Low) AS Low,
      ARRAY_AGG(Close ORDER BY ts DESC)[OFFSET (0)] AS Close,
      SUM(Volume) AS Volume
    FROM yourTable
    GROUP BY DIV(TIMESTAMP_DIFF(TIMESTAMP(ts), 
                 TIMESTAMP('2000-01-01 00:00:00'), MINUTE), 5)
    -- ORDER BY ts
    

    【讨论】:

    • 谢谢,但这只是删除时间与所需频率不匹配的行。我展示的是resample,它以所需的公式计算。例如,“音量”列由sum 重新采样,max 也为“高”列重新采样。
    • 我明白了 - 你的问题并不清楚!我建议对您期望的“重采样”逻辑提供更好的解释
    • 我现在明白了。很惊讶它被称为resampling,因为它在我看来并不像这样 - 但可能是特定领域的术语。无论如何,实现这个逻辑应该很简单。我只有下周才有机会对此进行调查,但 SO 上的某个人可能有时间回答:o)。
    • 确实这听起来有点令人困惑,无论如何,当有人想使用时间序列时,您的回答对于播放 bigquery 非常有用。希望你下周有时间! :)
    • 谢谢!这正是我想要的,我很惊讶它竟然可以如此简单!
    猜你喜欢
    • 1970-01-01
    • 2014-12-15
    • 1970-01-01
    • 2017-01-06
    • 2020-12-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多