【问题标题】:Finding length of max discountinuity寻找最大不连续的长度
【发布时间】:2016-11-04 07:43:21
【问题描述】:

我正在将一些代码从 python 列表原语迁移到 pandas 实现。对于某些时间序列,我想找到所有不连续段及其持续时间。在 pandas 中是否有一种干净的方法?

我的数据框如下所示:

In [23]: df
Out[23]:
2016-07-01 05:35:00    60.466667
2016-07-01 05:40:00          NaN
2016-07-01 05:45:00          NaN
2016-07-01 05:50:00          NaN
2016-07-01 05:55:00          NaN
2016-07-01 06:00:00          NaN
2016-07-01 06:05:00          NaN
2016-07-01 06:10:00          NaN
2016-07-01 06:15:00          NaN
2016-07-01 06:20:00          NaN
2016-07-01 06:25:00          NaN
2016-07-01 06:30:00          NaN
2016-07-01 06:35:00          NaN
2016-07-01 06:40:00          NaN
2016-07-01 06:45:00          NaN
2016-07-01 06:50:00          NaN
2016-07-01 06:55:00          NaN
2016-07-01 07:00:00          NaN
2016-07-01 07:05:00          NaN
2016-07-01 07:10:00          NaN
2016-07-01 07:15:00          NaN
2016-07-01 07:20:00          NaN
2016-07-01 07:25:00          NaN
2016-07-01 07:30:00          NaN
2016-07-01 07:35:00          NaN
2016-07-01 07:40:00          NaN
2016-07-01 07:45:00    63.500000
2016-07-01 07:50:00    67.293333
2016-07-01 07:55:00    67.633333
2016-07-01 08:00:00    68.306667
                         ...
2016-07-01 11:20:00          NaN
2016-07-01 11:25:00          NaN
2016-07-01 11:30:00    62.000000
2016-07-01 11:35:00    69.513333
2016-07-01 11:40:00    64.931298
2016-07-01 11:45:00    51.980000
2016-07-01 11:50:00    55.253333
2016-07-01 11:55:00    51.273333
2016-07-01 12:00:00    52.080000
2016-07-01 12:05:00    54.580000
2016-07-01 12:10:00    55.306667
2016-07-01 12:15:00    55.200000
2016-07-01 12:20:00    57.140000
2016-07-01 12:25:00    57.020000
2016-07-01 12:30:00    57.526667
2016-07-01 12:35:00    57.880000
2016-07-01 12:40:00    67.286667
2016-07-01 12:45:00    58.153333
2016-07-01 12:50:00    57.460000
2016-07-01 12:55:00    54.413333
2016-07-01 13:00:00    55.526667
2016-07-01 13:05:00    56.120000
2016-07-01 13:10:00    55.620000
2016-07-01 13:15:00    56.420000
2016-07-01 13:20:00    51.893333
2016-07-01 13:25:00    74.451613
2016-07-01 13:30:00    54.898551
2016-07-01 13:35:00          NaN
2016-07-01 13:40:00    63.355140
2016-07-01 13:45:00    61.000000
Freq: 5T, dtype: float64

例如,第一个不连续事件是从 5:40 到 7:40。

【问题讨论】:

  • 这看起来像一个系列,而不是一个数据帧。

标签: python numpy pandas time-series


【解决方案1】:

只要您有一个系列或单列数据框,这应该可以工作。

>>>pd.Series(df.isnull().index).diff()

可以通过以下方式改进以获得有用的输出:

MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
discontinuities = pd.Series(df.isnull().index).diff()
discontinuities.sort(ascending=False)
discontinuities[discontinuities > MIN_GAP_TIMEDELTA].size

【讨论】:

    【解决方案2】:

    不像基于 pandas 的解决方案那样优雅或简短,但考虑到性能,可以考虑使用 NumPy 数组和函数。因此,为了解决这种情况并假设日期时间有规律的频率,这是一种基于 NumPy 的方法来获取不连续长度、最大长度和阈值计数 -

    # Get indices of start and stop indices of discontinuities signified by NaNs
    idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]
    
    # Do differentiation on those indices which would give us the length of 
    # intervals of discontinuities. These could be used in various ways.
    discontinuity_lens = np.diff(idx.reshape(-1,2),axis=1)
    
    # Max discontinuity length
    discontinuity_maxlen = discontinuity_lens.max()
    
    # Count of discontinuities that are greater than a threshold of 30 mins as
    # listed with threshold parameter : MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
    # (in terms of steps that would be 6 because freq of input dataframe is 5 mins)
    thresholded_count = (discontinuity_lens>=6).sum()
    

    请注意这主要基于另一个NumPy solution to : Longest run/island of a number in Python.

    运行时测试

    我会为 @ilmarinen's pandas based solution 和本文前面发布的基于 NumPy 的方法计时,该方法是在一个足够大的数据帧上填充随机元素并随机放置 50% NaN。

    函数定义:

    def thresholdedcount_pandas(df):
        MIN_GAP_TIMEDELTA = pd.Timedelta(minutes=30)
        discontinuities = df.dropna().reset_index()['index'].diff()
        return (discontinuities > MIN_GAP_TIMEDELTA).sum()
    
    def thresholdedcount_numpy(df):
        idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]
        nan_interval_lens = np.diff(idx.reshape(-1,2),axis=1)
        return (nan_interval_lens>=6).sum()
    

    时间:

    In [325]: # Random dataframe with 5 min interval data and filled with 50% NaNs
         ...: rng = pd.date_range('1/1/2011', periods=10000, freq='5Min')
         ...: df = pd.DataFrame(np.random.randn(len(rng)), index=rng)
         ...: df[0][np.random.randint(0,df.shape[0],(int(df.shape[0]/2)))] = np.nan
         ...: 
    
    In [326]: np.allclose(thresholdedcount_pandas(df),thresholdedcount_numpy(df))
    Out[326]: True
    
    In [327]: %timeit thresholdedcount_pandas(df)
    100 loops, best of 3: 3 ms per loop
    
    In [328]: %timeit thresholdedcount_numpy(df)
    1000 loops, best of 3: 318 µs per loop
    

    【讨论】:

    • 将研究此解决方案。现在我宁愿选择一个更简单的解决方案,因为性能对这些来说不是一个大问题,因为它主要用于后台作业。
    猜你喜欢
    • 1970-01-01
    • 2016-09-17
    • 2019-10-13
    • 1970-01-01
    • 2020-07-19
    • 2013-05-19
    • 1970-01-01
    • 2019-08-22
    • 2015-02-03
    相关资源
    最近更新 更多