【问题标题】:Fancy time series grouping operations with Pandas dataframe使用 Pandas 数据框进行花式时间序列分组操作
【发布时间】:2016-10-02 21:52:32
【问题描述】:

我正在更新一个实现,我必须使用 pandas 并利用它的功能,我希望能得到一些帮助。我有一个看起来像这样的 pandas 事件数据框:

      ID               Start                 End
0 243552 2010-12-12 23:00:53 2010-12-12 23:37:14
1 243621 2010-12-12 23:25:58 2010-12-13 02:20:40
2 243580 2010-12-12 23:39:19 2010-12-13 07:22:39
3 243579 2010-12-12 23:42:53 2010-12-13 05:40:14
4 243491 2010-12-12 23:43:53 2010-12-13 07:48:14
...
...

ID 的 Dtype 为 int64StartEnd 的 Dtype 为 datetime64[ns]。请注意,数据框在 Start 列中排序,但不一定在 End 列中排序。

我想在用户输入的相同时间跨度的输入时间戳 t1t2 之间的某个时间范围内分析此数据,并生成一个新的数据帧索引这些时期的时间戳。

我想做的是对每个时期的数据进行分组,生成 5 列:Start_countEnd_countSpan_avgStart_inter_avgEnd_inter_avg。例如,考虑到一个 10 分钟的时间段分组,我想得到这个:

                     Start_count  End_count      Span_avg  Start_inter_avg  End_inter_avg
Period
2010-12-12 23:10:00            1          0      00:36:21         00:00:00       00:00:00
2010-12-12 23:20:00            0          0             0         00:00:00       00:00:00
2010-12-12 23:30:00            1          0      02:54:42         00:00:00       00:00:00
2010-12-12 23:40:00            1          1      07:43:20         00:00:00       00:00:00
2010-12-12 23:50:00            2          0      07:00:51         00:01:00       00:00:00
...
...

dtypes 将是:int64 用于 Start_countEnd_counttimedelta64[ns] 用于 Span_avgStart_inter_avg em> 和 End_inter_avg。我要生成的数据框的列是:

  • Start_count:原始数据帧的Start列中属于时间跨度]Period - 10 min, Period]的时间段内的时间戳数;
  • End_count:与Start_count相同,但考虑到End列;
  • Span_average:计算如下:1st)查看数据框中的条目并选择那些在]Period - 10 min, Period]中包含Start值的条目,2nd)在每个条目中计算差异End-Start, 3rd) 平均这些值。
  • Start_inter_avg:计算如下:1st)查看数据框中的条目并选择那些在]Period - 10 min, Period] 中包含 Start 值的条目,并对它们进行排序(嗯,它们'已经排序),2)计算连续时间戳之间的时间增量差异,3)平均这些差异。 (因此,如果在某个时间段内有 3 个开始时间戳 [a,b,c],则将有 2 个 timedelta 差异,[b-a, c-b] 并且最终值将等于(( b-a)+(c-b))/2)。
  • End_inter_avg:应该以与 Start_inter_avg 相同的方式计算,但使用来自 End 列的数据。 (请注意,现在必须进行预分类)。

例如,按 30 分钟分组时的结果表应该是:

                     Start_count  End_count      Span_avg  Start_inter_avg  End_inter_avg
Period
2010-12-12 23:30:00            2          0  01:45:31.500         00:25:05       00:00:00
2010-12-13 00:00:00            3          1  07:15:00.666         00:02:17       00:00:00
...
...

您可以试验一下这个 test.csv 文件:

ID,Start,End
243552,2010-12-12 23:00:53,2010-12-12 23:37:14
243621,2010-12-12 23:25:58,2010-12-13 02:20:40
243580,2010-12-12 23:39:19,2010-12-13 07:22:39
243579,2010-12-12 23:42:53,2010-12-13 05:40:14
243491,2010-12-12 23:43:53,2010-12-13 07:48:14
243490,2010-12-12 23:43:58,2010-12-13 01:18:40
243465,2010-12-13 00:07:53,2010-12-13 07:26:14
243515,2010-12-13 00:35:58,2010-12-13 03:41:40
243572,2010-12-13 00:46:58,2010-12-13 03:47:40
243520,2010-12-13 01:15:53,2010-12-13 05:14:14
243609,2010-12-13 01:29:53,2010-12-13 08:10:14
243482,2010-12-13 01:44:19,2010-12-13 05:57:39
243563,2010-12-13 01:49:53,2010-12-13 06:04:14
243414,2010-12-13 02:06:16,2010-12-13 02:46:48
243441,2010-12-13 02:15:16,2010-12-13 03:11:48
243548,2010-12-13 02:33:58,2010-12-13 02:49:40
243447,2010-12-13 05:01:42,2010-12-13 21:55:21
243531,2010-12-13 05:53:25,2010-12-13 07:49:59
243583,2010-12-13 05:53:25,2010-12-13 09:00:59
243593,2010-12-13 06:06:25,2010-12-13 09:50:59
243460,2010-12-13 06:14:42,2010-12-13 18:14:44
243596,2010-12-13 06:15:10,2010-12-13 21:47:25
243575,2010-12-13 06:22:42,2010-12-13 20:51:21
243514,2010-12-13 06:24:14,2010-12-13 08:34:07
243421,2010-12-13 06:31:14,2010-12-13 10:57:07
243471,2010-12-13 06:35:23,2010-12-13 14:11:13
243518,2010-12-13 06:36:48,2010-12-13 17:35:39
243565,2010-12-13 06:37:43,2010-12-13 17:16:22
243564,2010-12-13 06:48:16,2010-12-13 16:18:15
243424,2010-12-13 06:48:48,2010-12-13 16:19:39
243437,2010-12-13 06:58:46,2010-12-13 17:11:30
243573,2010-12-13 07:00:14,2010-12-13 09:46:07
243585,2010-12-13 07:01:35,2010-12-13 09:01:38
243483,2010-12-13 07:02:16,2010-12-13 16:36:15
243425,2010-12-13 07:04:21,2010-12-13 16:03:50
243570,2010-12-13 07:07:48,2010-12-13 08:51:04
243507,2010-12-13 07:10:03,2010-12-13 15:58:48
243535,2010-12-13 07:10:23,2010-12-13 11:31:13
243502,2010-12-13 07:13:21,2010-12-13 19:06:50
243525,2010-12-13 07:13:21,2010-12-13 19:34:50
243486,2010-12-13 07:13:56,2010-12-13 17:49:38
243451,2010-12-13 07:15:58,2010-12-13 17:34:03
243485,2010-12-13 07:17:35,2010-12-13 09:40:38
243487,2010-12-13 07:19:01,2010-12-13 10:39:35
243522,2010-12-13 07:19:25,2010-12-13 18:03:02
243481,2010-12-13 07:19:48,2010-12-13 11:08:04
243545,2010-12-13 07:20:42,2010-12-13 20:38:44
243492,2010-12-13 07:23:07,2010-12-13 17:38:42
243611,2010-12-13 07:23:23,2010-12-13 12:58:13
243508,2010-12-13 07:25:25,2010-12-13 18:29:02
243620,2010-12-13 07:25:46,2010-12-13 17:51:30
243466,2010-12-13 07:27:40,2010-12-13 19:05:58
243582,2010-12-13 07:29:29,2010-12-13 20:08:10
243568,2010-12-13 07:31:17,2010-12-13 15:30:37
243461,2010-12-13 07:32:24,2010-12-13 20:47:52
243623,2010-12-13 07:33:10,2010-12-13 10:34:20
243498,2010-12-13 07:33:25,2010-12-13 16:22:02
243427,2010-12-13 07:33:48,2010-12-13 20:00:39
243526,2010-12-13 07:34:10,2010-12-13 09:46:20
243472,2010-12-13 07:36:10,2010-12-13 20:36:25
243479,2010-12-13 07:36:48,2010-12-13 19:30:39
243494,2010-12-13 07:39:07,2010-12-13 17:03:42
243433,2010-12-13 07:39:35,2010-12-13 09:19:38
243503,2010-12-13 07:40:06,2010-12-13 13:53:08
243429,2010-12-13 07:40:35,2010-12-13 10:54:38
243422,2010-12-13 07:43:23,2010-12-13 10:35:10
243618,2010-12-13 07:46:19,2010-12-13 11:56:40
243445,2010-12-13 07:48:14,2010-12-13 10:15:07
243554,2010-12-13 07:49:14,2010-12-13 09:11:57
243542,2010-12-13 07:49:17,2010-12-13 18:53:37
243501,2010-12-13 07:50:40,2010-12-13 19:29:58
243529,2010-12-13 07:51:18,2010-12-13 17:14:15
243457,2010-12-13 07:53:55,2010-12-13 15:33:27
243613,2010-12-13 07:53:58,2010-12-13 17:00:03
243562,2010-12-13 07:54:01,2010-12-13 14:17:09
243571,2010-12-13 07:54:48,2010-12-13 18:39:39
243541,2010-12-13 07:58:53,2010-12-13 16:02:23
243510,2010-12-13 07:59:10,2010-12-13 19:04:51
243470,2010-12-13 07:59:46,2010-12-13 17:06:30
243448,2010-12-13 07:59:48,2010-12-13 18:38:39
243606,2010-12-13 08:03:21,2010-12-13 18:07:50
243430,2010-12-13 08:04:08,2010-12-13 17:49:41
243495,2010-12-13 08:04:25,2010-12-13 18:15:02
243591,2010-12-13 08:07:08,2010-12-13 17:33:54
243551,2010-12-13 08:07:10,2010-12-13 18:18:25
243459,2010-12-13 08:10:14,2010-12-13 10:53:07
243558,2010-12-13 08:11:00,2010-12-13 11:56:01
243605,2010-12-13 08:13:20,2010-12-13 16:38:14
243452,2010-12-13 08:15:23,2010-12-13 13:50:13
243446,2010-12-13 08:17:06,2010-12-13 14:00:08
243516,2010-12-13 08:17:20,2010-12-13 15:03:14
243450,2010-12-13 08:18:17,2010-12-13 16:21:37
243473,2010-12-13 08:19:22,2010-12-13 12:07:49
243438,2010-12-13 08:20:10,2010-12-13 19:34:25
243464,2010-12-13 08:21:03,2010-12-13 14:44:48
243536,2010-12-13 08:21:29,2010-12-13 17:32:15
243476,2010-12-13 08:21:58,2010-12-13 17:34:03
243595,2010-12-13 08:24:19,2010-12-13 11:38:40
243532,2010-12-13 08:27:10,2010-12-13 20:28:25
243497,2010-12-13 08:27:20,2010-12-13 14:12:14

尝试解决方案(回答部分问题)

这是我的解决方案尝试。我只做前 3 列,我得到 Start_countEnd_countfloat64 dtype,我按期间时间戳的第一个边界索引数据(与我问的不同,但没问题),总的来说,我想知道它是否可以以更简单、更短和更优雅的方式完成。

# Loading and parsing
data = pd.read_csv('test')
data.Start = pd.to_datetime(data.Start, format='%Y-%m-%d %H:%M:%S')
data.End = pd.to_datetime(data.End, format='%Y-%m-%d %H:%M:%S')


interval = 10  # minutes

Start_count = pd.Series(1, index=data.Start)
Start_count = Start_count.resample(str(interval)+'t').count()

# End_count series doesn't have the same length as Start_count
End_count = pd.Series(1, index=data.End)
End_count = End_count.resample(str(interval)+'t').count()

# This is an ugly way of going around encountered issues and doing what I wanted
Span = pd.Series(np.float64( (data.End - data.Start) / np.timedelta64(1,'s') ), index=data.Start)
Span_mean = Span.resample(str(interval)+'t').mean()
Span_mean = pd.to_timedelta(Span_mean, unit='s')

# When merging all series in a dataframe it seems that alignment is properly done
new_dataframe = pd.DataFrame(({'Start_count' : Start_count, 'End_count' : End_count, 'Span_avg' : Span_mean}))
new_dataframe.fillna(0,inplace=True)
new_dataframe.index.rename('Periods',inplace=True)

new_dataframe.head()  # Shows:

                     End_count  Span_avg  Start_count
Periods                                              
2010-12-12 23:00:00        0.0  00:36:21          1.0
2010-12-12 23:10:00        0.0  00:00:00          0.0
2010-12-12 23:20:00        0.0  02:54:42          1.0
2010-12-12 23:30:00        1.0  07:43:20          1.0
2010-12-12 23:40:00        0.0  05:12:08          3.0

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    这是一个难题,但这里是解决方案:

    import pandas as pd
    
    period = "10min"
    
    df = pd.read_csv("test.csv", parse_dates=[1, 2])
    span = df.End - df.Start
    start_period = df.Start.dt.floor(period)
    end_period = df.End.dt.floor(period)
    
    start_count = start_period.value_counts(sort=False)
    end_count = end_period.value_counts(sort=False)
    span_average = pd.to_timedelta(
        span.dt.total_seconds().groupby(start_period).mean().round(), 
        unit="s").rename("Span_average")
    
    def average_span(s):
        if len(s) > 1:
            return (s.max() - s.min()).total_seconds() / (len(s) - 1)
        else:
            return 0
    
    start_inter_avg = pd.to_timedelta(
        df.Start.groupby(start_period).agg(average_span).round(),
        unit="s").rename("Start_inter_avg")
    
    end_inter_avg = pd.to_timedelta(
        df.End.groupby(end_period).agg(average_span).round(),
        unit="s").rename("End_inter_avg")
    
    res = pd.concat([start_count, end_count, span_average, start_inter_avg, end_inter_avg], 
                    axis=1).resample(period).asfreq().fillna(0)
    

    输出:

                         Start  End  Span_average  Start_inter_avg  End_inter_avg
    2010-12-12 23:00:00    1.0  0.0      00:36:21         00:00:00       00:00:00
    2010-12-12 23:10:00    0.0  0.0      00:00:00         00:00:00       00:00:00
    2010-12-12 23:20:00    1.0  0.0      02:54:42         00:00:00       00:00:00
    2010-12-12 23:30:00    1.0  1.0      07:43:20         00:00:00       00:00:00
    2010-12-12 23:40:00    3.0  0.0      05:12:08         00:00:32       00:00:00
    2010-12-12 23:50:00    0.0  0.0      00:00:00         00:00:00       00:00:00
    2010-12-13 00:00:00    1.0  0.0      07:18:21         00:00:00       00:00:00
    2010-12-13 00:10:00    0.0  0.0      00:00:00         00:00:00       00:00:00
    

    【讨论】:

    • 谢谢,干得好!我最近才开始使用 pandas,我需要提高对它的一些功能的理解。我真的很喜欢你的解决方案。
    猜你喜欢
    • 1970-01-01
    • 2018-05-04
    • 2018-04-13
    • 2015-09-14
    • 2018-04-13
    • 1970-01-01
    • 2020-12-21
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多