【问题标题】:pandas groupby hour and calculate stockout timepandas groupby hour 并计算缺货时间
【发布时间】:2018-12-20 07:54:20
【问题描述】:

我有一个如下的时间序列:

| datetime_create         | quantity_old | quantity_new | quantity_diff | is_stockout |
| 2018-02-15 08:12:54.289 | 16           | 15           | -1            | False       |
| 2018-02-15 08:14:10.619 | 15           | 13           | -2            | False       |
| 2018-02-15 08:49:15.962 | 13           | 9            | -4            | False       |
| 2018-02-15 08:51:04.740 | 9            | 8            | -1            | False       |
| 2018-02-15 08:56:37.086 | 8            | 7            | -1            | False       |
| 2018-02-15 09:23:22.858 | 7            | 5            | -2            | False       |
| 2018-02-15 10:16:50.324 | 5            | 4            | -1            | False       |
| 2018-02-15 10:19:25.071 | 4            | 3            | -1            | False       |
| 2018-02-15 10:33:22.788 | 3            | 2            | -1            | False       |
| 2018-02-15 10:33:34.125 | 2            | 0            | -2            | True        |
| 2018-02-15 16:45:24.747 | 0            | 1            | 1             | False       |
| 2018-02-15 16:48:29.996 | 1            | 0            | -1            | True        |
| 2018-02-17 10:42:58.325 | 0            | 42           | 42            | False       |
| 2018-02-17 10:47:07.380 | 42           | 41           | -1            | False       |
| 2018-02-17 11:42:31.008 | 41           | 40           | -1            | False       |
| 2018-02-17 11:48:31.070 | 40           | 39           | -1            | False       |
| 2018-02-17 12:39:13.681 | 39           | 38           | -1            | False       |
| 2018-02-17 12:48:00.286 | 38           | 37           | -1            | False       |
| 2018-02-17 12:56:59.203 | 37           | 36           | -1            | False       |
| 2018-02-17 13:18:12.285 | 36           | 35           | -1            | False       |
| 2018-02-17 13:29:53.465 | 35           | 34           | -1            | False       |
| 2018-02-17 14:54:55.810 | 34           | 33           | -1            | False       |
| 2018-02-17 15:53:38.816 | 33           | 32           | -1            | False       |
| 2018-02-17 16:28:08.076 | 32           | 31           | -1            | False       |
| 2018-02-17 16:45:18.965 | 31           | 30           | -1            | False       |
| 2018-02-17 16:59:11.111 | 30           | 29           | -1            | False       |
| 2018-02-17 17:18:53.646 | 29           | 27           | -2            | False       |
| 2018-02-17 17:44:43.508 | 27           | 26           | -1            | False       |
| 2018-02-17 19:34:49.701 | 26           | 25           | -1            | False       |
| 2018-02-17 20:49:00.205 | 25           | 24           | -1            | False       |
| 2018-02-18 07:14:22.207 | 24           | 22           | -2            | False       |
| 2018-02-18 08:35:41.560 | 22           | 20           | -2            | False       |
| 2018-02-18 10:22:18.825 | 20           | 19           | -1            | False       |
| 2018-02-18 10:28:33.909 | 19           | 18           | -1            | False       |
| 2018-02-18 10:37:30.427 | 18           | 17           | -1            | False       |
| 2018-02-18 10:50:55.265 | 17           | 16           | -1            | False       |
| 2018-02-18 11:17:53.359 | 16           | 15           | -1            | False       |
| 2018-02-18 11:42:29.214 | 0            | 30           | 30            | False       |
| 2018-02-18 11:58:19.113 | 15           | 14           | -1            | False       |
| 2018-02-18 11:58:56.432 | 14           | 13           | -1            | False       |
| 2018-02-18 12:06:48.438 | 13           | 12           | -1            | False       |
| 2018-02-18 12:21:43.634 | 12           | 11           | -1            | False       |
| 2018-02-18 12:44:46.288 | 11           | 9            | -2            | False       |
| 2018-02-18 13:26:01.952 | 9            | 8            | -1            | False       |
| 2018-02-18 13:26:40.940 | 8            | 9            | 1             | False       |
| 2018-02-18 13:27:34.090 | 9            | 8            | -1            | False       |
| 2018-02-18 13:27:52.443 | 8            | 9            | 1             | False       |
| 2018-02-18 13:28:58.832 | 9            | 8            | -1            | False       |
| 2018-02-18 14:56:49.105 | 8            | 7            | -1            | False       |
| 2018-02-18 16:00:32.212 | 7            | 6            | -1            | False       |
| 2018-02-18 16:28:20.175 | 6            | 5            | -1            | False       |
| 2018-02-18 16:31:48.741 | 5            | 3            | -2            | False       |
| 2018-02-18 16:40:33.922 | 3            | 2            | -1            | False       |
| 2018-02-18 16:56:17.864 | 2            | 1            | -1            | False       |
| 2018-02-18 17:15:01.065 | 1            | 2            | 1             | False       |
| 2018-02-18 17:40:43.062 | 2            | 1            | -1            | False       |
| 2018-02-18 17:55:50.520 | 1            | 0            | -1            | True        |
| 2018-02-18 18:20:21.664 | 30           | 29           | -1            | False       |
| 2018-02-18 21:38:10.645 | 29           | 28           | -1            | False       |
| 2018-02-19 06:36:04.564 | 28           | 27           | -1            | False       |
| 2018-02-19 08:49:23.080 | 27           | 26           | -1            | False       |

我想计算一天中每小时的总缺货时间,比如

|    date    |  0  |  1  |  2  |  3  | ... | 23  |
| ---------- | --- | --- | --- | --- | --- | --- |
| 2018-02-15 | 10  | 0   | 0   | 10  | ... | 13  |
| 2018-02-16 | 6   | 0   | 7   | 10  | ... | 20  |
| 2018-02-17 | 6   | 0   | 0   | 10  | ... | 20  |

规则:

  1. 按小时分组
  2. 我可以在一小时内访问所有行。
  3. 计算时间间隔

    • 起点:is_stockoutFalseTrue
    • 端点:is_stockoutTrueFalse

    一个小时后。 可能有很多start pointend point

  4. 将索引更改为天,将列更改为 24 小时。

有点像new-syntax-to-window-and-resample-operations

我觉得我需要使用

df.resample('H').apply(caluclate_time_in_hour)

但这似乎还不够:

  1. df.resample('H') 结果索引是小时,而不是列
  2. 如何写出正确的caluclate_time_in_hour?我认为apply 不能这样做。

    我写了一个伪代码:

    def caluclate_time_in_hour(item):
        # note: item here is stockcount . not just True or False
    
        global last_time
        global is_stockout
        global data
    
        cur_time = item.name
    
        # I need pandas return every row even that hour doesn't have data
        # so that no need to check the how many hours elasped.
    
        if item is np.nan:
            if is_stockout:
                data[cur_time.hour] = 60*60
            else:
                data[cur_time.hour] = 0
    
        if is_stockout:
            if item > 0:
                data[cur_time.hour] += cur_time - last_time
            else:
                is_stockout = False
        else:
            if item = 0:
                is_stockout = True
    
        last_time = item.name
    
        return data.copy()
    

    如何知道这个项目是这个小时的最后一个,以便我可以退回data?这是apply 问题。也许我需要 pandas 按小时返回所有行来申请。

我只是想知道我可以通过 pandas 内置函数完成上述操作,而无需循环所有行来构造新的 DataFrame。


例如,2018-02-15 ~ 2018-02-16 有以下两条记录:

| datetime_create     | quantity_old | quantity_new | quantity_diff | is_stockout |
| 2018-02-14 00:45:00 | 40           | 10           | -30           | False       |
| 2018-02-15 12:45:00 | 10           | 2            | -8            | False       |
| 2018-02-15 13:45:00 | 2            | 1            | -1            | False       |
| 2018-02-15 16:45:00 | 1            | 0            | -1            | True        |
| 2018-02-16 10:42:00 | 0            | 42           | 42            | False       |
| 2018-02-16 13:42:00 | 42           | 40           | -2            | False       |
| 2018-02-16 19:42:00 | 40           | 38           | -2            | False       |
| 2018-02-17 20:42:00 | 38           | 40           | 2             | False       |
# duplicate above 
| 2018-02-18 00:45:00 | 40           | 10           | -30           | False       |
| 2018-02-19 12:45:00 | 10           | 2            | -8            | False       |
| 2018-02-19 13:45:00 | 2            | 1            | -1            | False       |
| 2018-02-19 16:45:00 | 1            | 0            | -1            | True        |
| 2018-02-20 10:42:00 | 0            | 42           | 42            | False       |
| 2018-02-20 13:42:00 | 42           | 40           | -2            | False       |
| 2018-02-20 19:42:00 | 40           | 38           | -2            | False       |
| 2018-02-21 20:42:00 | 38           | 40           | 2             | False       |

csv:

datetime_create,quantity_old,quantity_new,quantity_diff,is_stockout
2018-02-14 00:45:00,40,10,-30,False
2018-02-15 12:45:00,10,2,-8,False
2018-02-15 13:45:00,2,1,-1,False
2018-02-15 16:45:00,1,0,-1,True
2018-02-16 10:42:00,0,42,42,False
2018-02-16 13:42:00,42,40,-2,False
2018-02-16 19:42:00,40,38,-2,False
2018-02-17 20:42:00,38,40,2,False
2018-02-18 00:45:00,40,10,-30,False
2018-02-19 12:45:00,10,2,-8,False
2018-02-19 13:45:00,2,1,-1,False
2018-02-19 16:45:00,1,0,-1,True
2018-02-20 10:42:00,0,42,42,False
2018-02-20 13:42:00,42,40,-2,False
2018-02-20 19:42:00,40,38,-2,False
2018-02-21 20:42:00,38,40,2,False

结果(这里的时间单位是分钟,为了美观):

date,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
2018-02-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0
2018-02-16,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0
2018-02-20,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

【问题讨论】:

  • 我不确定df['s'] = df['is_stockout'].cumsum() 的开始和结束时间间隔是否正确?
  • @jezrael 抱歉,我一开始没有发布所有数据结构。有库存量记录。 df['is_stockout'] 来自 df['quantity_new'] ==0 。如果quantity_new0,那么就是缺货。起初,我认为把这两列发帖会让问题更简单。
  • 谢谢,可以添加一些输出期望值吗? diff 不易数,但列is_stockout 可以吗?
  • @jezrael 我已经更新了这个问题。我的原始数据已经有diff,不需要计算。
  • 抱歉,是否可以向样本 2 行 DataFrame 添加另外 2-3 行的预期输出?

标签: python pandas pandas-groupby


【解决方案1】:

我认为首先需要 resample 分钟前填充 NaNs,转换为 inetgers 并为 Series 添加 DataFrame.squeeze

然后由dates 和hours 与sum 聚合,最后由unstack 整形:

s = df[['is_stockout']].resample('T').ffill().astype(int).squeeze()
df1 = s.groupby([s.index.date, s.index.hour]).sum().unstack(fill_value=0)
print (df1)
datetime_create  0   1   2   3   4   5   6   7   8   9  ...  14  15  16  17  \
2018-02-14        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   
2018-02-15        0   0   0   0   0   0   0   0   0   0 ...   0   0  15  60   
2018-02-16       60  60  60  60  60  60  60  60  60  60 ...   0   0   0   0   
2018-02-17        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   
2018-02-18        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   
2018-02-19        0   0   0   0   0   0   0   0   0   0 ...   0   0  15  60   
2018-02-20       60  60  60  60  60  60  60  60  60  60 ...   0   0   0   0   
2018-02-21        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   

datetime_create  18  19  20  21  22  23  
2018-02-14        0   0   0   0   0   0  
2018-02-15       60  60  60  60  60  60  
2018-02-16        0   0   0   0   0   0  
2018-02-17        0   0   0   0   0   0  
2018-02-18        0   0   0   0   0   0  
2018-02-19       60  60  60  60  60  60  
2018-02-20        0   0   0   0   0   0  
2018-02-21        0   0   0   0   0   0  

【讨论】:

  • 哇!我从没想过可以用 ffill 在几分钟内重新采样,这是一种聪明的方法。感谢您的帮助。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2013-04-24
  • 2018-10-19
  • 1970-01-01
  • 2021-11-08
  • 1970-01-01
  • 2017-08-06
  • 1970-01-01
相关资源
最近更新 更多