【发布时间】:2018-11-14 15:22:16
【问题描述】:
我有一张这样的表格(电子邮件在这里被简化为一个字母):
timestamp | email
2018-10-17 13:00:00+00:00 | m
2018-10-17 13:00:00+00:00 | m
2018-10-17 13:00:10+00:00 |
2018-10-17 13:00:10+00:00 | v
2018-10-17 13:00:30+00:00 |
2018-10-17 13:00:30+00:00 | c
2018-10-17 13:00:50+00:00 | p
2018-10-17 13:01:00+00:00 |
2018-10-17 13:01:00+00:00 | m
2018-10-17 13:01:00+00:00 | s
2018-10-17 13:01:00+00:00 | b
现在,我想创建一个新列,例如计算邮件在输入前最后 30 秒内重复的次数。
timestamp | email | count | comment
2018-10-17 13:00:00+00:00 | m | 1 |
2018-10-17 13:00:00+00:00 | m | 2 | (there were 2 entries in the last 30s)
2018-10-17 13:00:10+00:00 | | 1 | (empty we count as well)
2018-10-17 13:00:10+00:00 | v | 1 |
2018-10-17 13:00:30+00:00 | | 2 | (counting the empty like emails)
2018-10-17 13:00:30+00:00 | c | 1 |
2018-10-17 13:00:50+00:00 | p | 1 |
2018-10-17 13:01:00+00:00 | | 2 | (in the last 30s from this ts, we have 2)
2018-10-17 13:01:00+00:00 | m | 1 | (the first 2 m happened before the last 30s)
2018-10-17 13:01:00+00:00 | s | 1 |
2018-10-17 13:01:00+00:00 | b | 1 |
时间戳是一个 dateTime 对象
timestamp datetime64[ns, UTC]
此外,它是索引并且已排序。 我第一次尝试,这个命令:
df['email'].groupby(df.email).rolling('120s').count().values
但它不适用于字符串,所以我将它转换为唯一的数字,使用:
full_df['email'].factorize()
但结果似乎不对:
timestamp | email | count | comment
2018-10-17 13:00:00+00:00 | m | 1 |
2018-10-17 13:00:00+00:00 | m | 2 |
2018-10-17 13:00:10+00:00 | | 1 |
2018-10-17 13:00:10+00:00 | v | 2 | (No ideia about this result)
2018-10-17 13:00:30+00:00 | | 3 | (Appears to just keeping count)
2018-10-17 13:00:30+00:00 | c | 1 | (Then just go back to 1 again... )
2018-10-17 13:00:50+00:00 | p | 2 |
2018-10-17 13:01:00+00:00 | | 3 |
2018-10-17 13:01:00+00:00 | m | 4 |
2018-10-17 13:01:00+00:00 | s | 1 |
2018-10-17 13:01:00+00:00 | b | 1 |
任何想法我做错了什么,我怎样才能得到我想要的?
非常感谢, 若昂
【问题讨论】: