Pandas：像堆栈一样使用滚动计数答案

【问题标题】：Pandas: Count like a stack using rollingPandas：像堆栈一样使用滚动计数
【发布时间】：2018-11-14 15:22:16
【问题描述】：

我有一张这样的表格（电子邮件在这里被简化为一个字母）：

timestamp                  | email
2018-10-17 13:00:00+00:00  | m
2018-10-17 13:00:00+00:00  | m
2018-10-17 13:00:10+00:00  | 
2018-10-17 13:00:10+00:00  | v
2018-10-17 13:00:30+00:00  |  
2018-10-17 13:00:30+00:00  | c
2018-10-17 13:00:50+00:00  | p
2018-10-17 13:01:00+00:00  |  
2018-10-17 13:01:00+00:00  | m
2018-10-17 13:01:00+00:00  | s
2018-10-17 13:01:00+00:00  | b

现在，我想创建一个新列，例如计算邮件在输入前最后 30 秒内重复的次数。

timestamp                  | email | count | comment
2018-10-17 13:00:00+00:00  | m     |   1   |
2018-10-17 13:00:00+00:00  | m     |   2   | (there were 2 entries in the last 30s)
2018-10-17 13:00:10+00:00  |       |   1   | (empty we count as well)
2018-10-17 13:00:10+00:00  | v     |   1   |
2018-10-17 13:00:30+00:00  |       |   2   | (counting the empty like emails)
2018-10-17 13:00:30+00:00  | c     |   1   | 
2018-10-17 13:00:50+00:00  | p     |   1   |
2018-10-17 13:01:00+00:00  |       |   2   | (in the last 30s from this ts, we have 2)
2018-10-17 13:01:00+00:00  | m     |   1   | (the first 2 m happened before the last 30s)
2018-10-17 13:01:00+00:00  | s     |   1   |
2018-10-17 13:01:00+00:00  | b     |   1   |

时间戳是一个 dateTime 对象

timestamp          datetime64[ns, UTC]

此外，它是索引并且已排序。我第一次尝试，这个命令：

df['email'].groupby(df.email).rolling('120s').count().values

但它不适用于字符串，所以我将它转换为唯一的数字，使用：

full_df['email'].factorize()

但结果似乎不对：

timestamp                  | email | count | comment
2018-10-17 13:00:00+00:00  | m     |   1   |  
2018-10-17 13:00:00+00:00  | m     |   2   | 
2018-10-17 13:00:10+00:00  |       |   1   | 
2018-10-17 13:00:10+00:00  | v     |   2   |  (No ideia about this result)
2018-10-17 13:00:30+00:00  |       |   3   | (Appears to just keeping count)
2018-10-17 13:00:30+00:00  | c     |   1   |  (Then just go back to 1 again... )
2018-10-17 13:00:50+00:00  | p     |   2   |
2018-10-17 13:01:00+00:00  |       |   3   | 
2018-10-17 13:01:00+00:00  | m     |   4   | 
2018-10-17 13:01:00+00:00  | s     |   1   |
2018-10-17 13:01:00+00:00  | b     |   1   |

任何想法我做错了什么，我怎样才能得到我想要的？

非常感谢，若昂

【问题讨论】：

标签： python pandas jupyter

【解决方案1】：

您可以在rolling 之后使用apply 来计算窗口的最后一个元素出现在窗口中的次数，如下所示：

df['count'] = df['email'].astype('category').cat.codes.rolling('30s').apply(lambda x: sum(x==x[-1]))

【讨论】：