如何一次旋转时间序列列的 N 个观察值答案

【问题标题】：How to pivot N observations of a time series column at a time如何一次旋转时间序列列的 N 个观察值
【发布时间】：2020-11-15 03:17:55
【问题描述】：

我有一个这样的数据框

    date
2018-02-28 09:00:00    78700.0
2018-02-28 10:00:00    78900.0
2018-02-28 11:00:00    78100.0
2018-02-28 12:00:00    78100.0
2018-02-28 13:00:00    77500.0
                        ...
2018-11-30 11:00:00    70000.0
2018-11-30 12:00:00    69800.0
2018-11-30 13:00:00    69800.0
2018-11-30 14:00:00    69600.0
2018-11-30 15:00:00    69400.0

并且我想在每行中将时间序列变量旋转一定长度（在这种情况下，时间步长为 6，因此我希望每行有 6 列）。下面的预期结果类似于 Toeplitz 矩阵的子集。

date                       0        1        2        3        4        5
2018-02-28 09:00:00  78700.0  78900.0  78100.0  78100.0  77500.0  77100.0
2018-02-28 10:00:00  78900.0  78100.0  78100.0  77500.0  77100.0  77100.0
2018-02-28 11:00:00  78100.0  78100.0  77500.0  77100.0  77100.0  76300.0
2018-02-28 12:00:00  78100.0  77500.0  77100.0  77100.0  76300.0  76200.0
2018-02-28 13:00:00  77500.0  77100.0  77100.0  76300.0  76200.0  76700.0
...                      ...      ...      ...      ...      ...      ...
2018-11-29 12:00:00  72000.0  72000.0  71800.0  71500.0  71500.0  70000.0
2018-11-29 13:00:00  72000.0  71800.0  71500.0  71500.0  70000.0  70000.0
2018-11-29 14:00:00  71800.0  71500.0  71500.0  70000.0  70000.0  69800.0
2018-11-29 15:00:00  71500.0  71500.0  70000.0  70000.0  69800.0  69800.0
2018-11-30 09:00:00  71500.0  70000.0  70000.0  69800.0  69800.0  69600.0

我只是将拆分的块附加到新的 Dataframe 中来完成这个，但它太慢了；( 有没有一种优雅的方式来执行这种转换？

【问题讨论】：

标签： python pandas dataframe scipy

【解决方案1】：

有一种方法可以使用Hankel matrix 和一些数组操作来实现您想要的输出。您可以使用scipy.linalg.hankel 函数构造汉克尔矩阵。

在下文中，我定义了一个自定义函数time_series_to_hankel()，它将您的 pandas DataFrame、您要堆叠在一行中的时间序列变量以及时间步数作为输入。

import numpy as np
import pandas as pd
from scipy.linalg import hankel

def time_series_to_hankel(data, ts_col, n_steps):
    
    # generate hankel dataframe for the time series column
    h = hankel(data[ts_col])[:(data.shape[0] - n_steps + 1), :n_steps]
    h_df = pd.DataFrame(h, columns=['t_' + str(i) for i in range(h.shape[1])])
    
    # manipulate the original df
    temp_df = data.drop(columns=['value']).loc[:(h.shape[0] - 1)]
    
    # concat the two dataframes
    return pd.concat([temp_df, h_df], axis=1)

如果你想了解所有段落的基本原理，我建议你一步一步地运行它。

例子

import numpy as np
import pandas as pd
from scipy.linalg import hankel

# similar to your sample dataset
df = pd.DataFrame({
    'date': pd.date_range('2018-02-28 09:00:00', '2018-11-30 15:00:00', freq='H'),
    'test_var': np.random.randint(1, 10, size=6607),
    'value': np.linspace(78700, 69400, num=6607).astype(int)
})

time_series_to_hankel(df, 'value', n_steps=6)
                    date  test_var    t_0    t_1    t_2    t_3    t_4    t_5
0    2018-02-28 09:00:00         7  78700  78698  78697  78695  78694  78692
1    2018-02-28 10:00:00         9  78698  78697  78695  78694  78692  78691
2    2018-02-28 11:00:00         2  78697  78695  78694  78692  78691  78690
3    2018-02-28 12:00:00         8  78695  78694  78692  78691  78690  78688
4    2018-02-28 13:00:00         1  78694  78692  78691  78690  78688  78687
...                  ...       ...    ...    ...    ...    ...    ...    ...
6597 2018-11-30 06:00:00         8  69412  69411  69409  69408  69407  69405
6598 2018-11-30 07:00:00         4  69411  69409  69408  69407  69405  69404
6599 2018-11-30 08:00:00         3  69409  69408  69407  69405  69404  69402
6600 2018-11-30 09:00:00         6  69408  69407  69405  69404  69402  69401
6601 2018-11-30 10:00:00         4  69407  69405  69404  69402  69401  69400

【讨论】：