【问题标题】:Most efficient way to determine overlapping timeseries in Python在 Python 中确定重叠时间序列的最有效方法
【发布时间】:2017-02-14 14:17:15
【问题描述】:

我正在尝试使用 python 的 pandas 库确定两个时间序列重叠的时间百分比。数据是非同步的,因此每个数据点的时间不会对齐。这是一个例子:

时间序列 1

2016-10-05 11:50:02.000734    0.50
2016-10-05 11:50:03.000033    0.25
2016-10-05 11:50:10.000479    0.50
2016-10-05 11:50:15.000234    0.25
2016-10-05 11:50:37.000199    0.50
2016-10-05 11:50:49.000401    0.50
2016-10-05 11:50:51.000362    0.25
2016-10-05 11:50:53.000424    0.75
2016-10-05 11:50:53.000982    0.25
2016-10-05 11:50:58.000606    0.75

时间序列 2

2016-10-05 11:50:07.000537    0.50
2016-10-05 11:50:11.000994    0.50
2016-10-05 11:50:19.000181    0.50
2016-10-05 11:50:35.000578    0.50
2016-10-05 11:50:46.000761    0.50
2016-10-05 11:50:49.000295    0.75
2016-10-05 11:50:51.000835    0.75
2016-10-05 11:50:55.000792    0.25
2016-10-05 11:50:55.000904    0.75
2016-10-05 11:50:57.000444    0.75

假设序列在下一次更改之前保持其值,确定它们具有相同值的时间百分比的最有效方法是什么?

示例

让我们计算这些系列重叠的时间,从 11:50:07.000537 开始,到 2016-10-05 11:50:57.000444 0.75 结束,因为我们拥有这两个系列在该期间的数据。有重叠的时间:

  • 11:50:10.000479 - 11:50:15.000234(均为 0.5)4.999755 秒
  • 11:50:37.000199 - 11:50:49.000295(均为 0.5)12.000096 秒
  • 11:50:53.000424 - 11:50:53.000982(均为 0.75)0.000558 秒
  • 11:50:55.000792 - 11:50:55.000904(均为 0.25)0.000112 秒

结果 (4.999755+12.000096+0.000558+0.000112) / 49.999907 = 34%

其中一个问题是我的实际时间序列有更多数据,例如 1000 - 10000 次观察,我需要运行更多对。我考虑过向前填充一个系列,然后简单地比较行并将匹配的总数除以总行数,但我认为这不会很有效。

【问题讨论】:

    标签: python performance pandas time-series pandas-groupby


    【解决方案1】:

    设置
    创建 2 个时间序列

    from StringIO import StringIO
    import pandas as pd
    
    
    txt1 = """2016-10-05 11:50:02.000734    0.50
    2016-10-05 11:50:03.000033    0.25
    2016-10-05 11:50:10.000479    0.50
    2016-10-05 11:50:15.000234    0.25
    2016-10-05 11:50:37.000199    0.50
    2016-10-05 11:50:49.000401    0.50
    2016-10-05 11:50:51.000362    0.25
    2016-10-05 11:50:53.000424    0.75
    2016-10-05 11:50:53.000982    0.25
    2016-10-05 11:50:58.000606    0.75"""
    
    s1 = pd.read_csv(StringIO(txt1), sep='\s{2,}', engine='python',
                     parse_dates=[0], index_col=0, header=None,
                     squeeze=True).rename('s1').rename_axis(None)
    
    txt2 = """2016-10-05 11:50:07.000537    0.50
    2016-10-05 11:50:11.000994    0.50
    2016-10-05 11:50:19.000181    0.50
    2016-10-05 11:50:35.000578    0.50
    2016-10-05 11:50:46.000761    0.50
    2016-10-05 11:50:49.000295    0.75
    2016-10-05 11:50:51.000835    0.75
    2016-10-05 11:50:55.000792    0.25
    2016-10-05 11:50:55.000904    0.75
    2016-10-05 11:50:57.000444    0.75"""
    
    s2 = pd.read_csv(StringIO(txt2), sep='\s{2,}', engine='python',
                     parse_dates=[0], index_col=0, header=None,
                     squeeze=True).rename('s2').rename_axis(None)
    

    TL;DR

    df = pd.concat([s1, s2], axis=1).ffill().dropna()
    overlap = df.index.to_series().diff().shift(-1) \
                .fillna(0).groupby(df.s1.eq(df.s2)).sum()
    overlap.div(overlap.sum())
    
    False    0.666657
    True     0.333343
    Name: duration, dtype: float64
    

    解释

    构建基础pd.DataFramedf

    • 使用pd.concat 对齐索引
    • 使用ffill 让值向前传播
    • 使用dropna 在另一个系列开始之前删除一个系列的值

    df = pd.concat([s1, s2], axis=1).ffill().dropna()
    df
    

    计算'duration'
    从当前时间戳到下一个时间戳

    df['duration'] = df.index.to_series().diff().shift(-1).fillna(0)
    df
    

    计算重叠

    • df.s1.eq(df.s2) 给出 s1s2 重叠时的布尔系列
    • TrueFalse 时,使用上面的布尔系列groupby 聚合总持续时间

    overlap = df.groupby(df.s1.eq(df.s2)).duration.sum()
    overlap
    
    False   00:00:33.999548
    True    00:00:17.000521
    Name: duration, dtype: timedelta64[ns]
    

    具有相同值的时间百分比

    overlap.div(overlap.sum())
    
    False    0.666657
    True     0.333343
    Name: duration, dtype: float64
    

    【讨论】:

    • 整洁!化妆品问题:使用 .eq 而不是 == 的任何理由,但这里的编码风格?
    • @Boud 编码风格。如果我想做其他事情,我不喜欢将整个表达式括在括号中。我想我什至在某些情况下测试它的速度更快。
    【解决方案2】:

    很酷的问题。我使用 pandas 或 numpy 强行强制执行此操作,但我得到了您的答案(感谢您的解决)。我没有在其他任何东西上测试过它。我也不知道它有多快,因为它只遍历每个数据帧一次,但不做任何矢量化。

    import pandas as pd
    #############################################################################
    #Preparing the dataframes
    times_1 = ["2016-10-05 11:50:02.000734","2016-10-05 11:50:03.000033",
               "2016-10-05 11:50:10.000479","2016-10-05 11:50:15.000234",
               "2016-10-05 11:50:37.000199","2016-10-05 11:50:49.000401",
               "2016-10-05 11:50:51.000362","2016-10-05 11:50:53.000424",
               "2016-10-05 11:50:53.000982","2016-10-05 11:50:58.000606"]
    times_1 = [pd.Timestamp(t) for t in times_1]
    vals_1 = [0.50,0.25,0.50,0.25,0.50,0.50,0.25,0.75,0.25,0.75]
    
    times_2 = ["2016-10-05 11:50:07.000537","2016-10-05 11:50:11.000994",
               "2016-10-05 11:50:19.000181","2016-10-05 11:50:35.000578",
               "2016-10-05 11:50:46.000761","2016-10-05 11:50:49.000295",
               "2016-10-05 11:50:51.000835","2016-10-05 11:50:55.000792",
               "2016-10-05 11:50:55.000904","2016-10-05 11:50:57.000444"]
    times_2 = [pd.Timestamp(t) for t in times_2]
    vals_2 = [0.50,0.50,0.50,0.50,0.50,0.75,0.75,0.25,0.75,0.75]
    
    data_1 = pd.DataFrame({"time":times_1,"vals":vals_1})
    data_2 = pd.DataFrame({"time":times_2,"vals":vals_2})
    #############################################################################
    
    shared_time = 0      #Keep running tally of shared time
    t1_ind = 0           #Pointer to row in data_1 dataframe
    t2_ind = 0           #Pointer to row in data_2 dataframe
    
    #Loop through both dataframes once, incrementing either the t1 or t2 index
    #Stop one before the end of both since do +1 indexing in loop
    while t1_ind < len(data_1.time)-1 and t2_ind < len(data_2.time)-1:
        #Get val1 and val2
        val1,val2 = data_1.vals[t1_ind], data_2.vals[t2_ind]
    
        #Get the start and stop of the current time window
        t1_start,t1_stop = data_1.time[t1_ind], data_1.time[t1_ind+1]
        t2_start,t2_stop = data_2.time[t2_ind], data_2.time[t2_ind+1]
    
        #If the start of time window 2 is in time window 1
        if val1 == val2 and (t1_start <= t2_start <= t1_stop):
            shared_time += (min(t1_stop,t2_stop)-t2_start).total_seconds()
            t1_ind += 1
        #If the start of time window 1 is in time window 2
        elif val1 == val2 and t2_start <= t1_start <= t2_stop:
            shared_time += (min(t1_stop,t2_stop)-t1_start).total_seconds()
            t2_ind += 1
        #If there is no time window overlap and time window 2 is larger
        elif t1_start < t2_start:
            t1_ind += 1
        #If there is no time window overlap and time window 1 is larger
        else:
            t2_ind += 1
    
    #How I calculated the maximum possible shared time (not pretty)
    shared_start = max(data_1.time[0],data_2.time[0])
    shared_stop = min(data_1.time.iloc[-1],data_2.time.iloc[-1])
    max_possible_shared = (shared_stop-shared_start).total_seconds()
    
    #Print output
    print "Shared time:",shared_time
    print "Total possible shared:",max_possible_shared
    print "Percent shared:",shared_time*100/max_possible_shared,"%"
    

    输出:

    Shared time: 17.000521
    Total possible shared: 49.999907
    Percent shared: 34.0011052421 %
    

    【讨论】:

    • 酷,这是我一直在寻找的,但我希望有更快的解决方案。如果存在匹配项,我确实需要将 if val1 != val2: 下的逻辑更改为相同的索引更新,因为您不能假设您只能移动第一个索引
    • @klib 是的,我也觉得可能有 pandas 或 numpy 解决方案。你不能假设第一个索引绝对是正确的,我已经编辑了我的答案
    猜你喜欢
    • 1970-01-01
    • 2013-01-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-06-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多