【问题标题】:Count number of times an element in column appears over timestamp计算列中元素出现在时间戳上的次数
【发布时间】:2019-09-17 20:46:16
【问题描述】:
对于 Pandas DataFrame 中的给定行,我需要计算当前列值的次数,例如“destination_address_IP”在过去(例如)2 秒内使用“Time_stamp”列发生,并将值放入新列“count”。
【问题讨论】:
标签:
pandas
dataframe
time
count
【解决方案1】:
您可以通过重复移动数据框来执行以下操作。基本假设是,数据帧按时间戳列排序:
# define the threshold in milliseconds (2 seconds)
time_threshold= 2000000
df['ip_count']=1
df_shifted= df
# loop over the dataframe and shift it by one row
# until the time_threshold is violated for all rows
while True:
# shift the copy of the dataframe
df_shifted= df_shifted.shift(1)
# check if time range is ok
ser_time_diff= (df['Time_stamp'] - df_shifted['Time_stamp'])
ser_in_time= ser_time_diff.dt.microseconds + ser_time_diff.dt.seconds * 1000000 < time_threshold
if ser_in_time.any():
# there are still rows left, where the shifted
# frame's timestamp lies within the threshold
# so we need to count the matches for those rows
# if there are any
ser_match= ser_in_time & (df['destination_address_IP'] == df_shifted['destination_address_IP'])
df['ip_count']+= ser_match.astype('int')
else:
# none of the rows of the shifted df was within
# the threshold of the original df
# so further shifts will not change the result
# anymore
break
df
我的 testdata 的结果如下所示:
Time_stamp destination_address_IP ip_count
0 2019-09-17 19:20:45.093209 157.111.73.31 1
1 2019-09-17 19:20:45.297932 127.0.0.1 1
2 2019-09-17 19:20:45.750725 157.111.73.31 2
3 2019-09-17 19:20:46.787009 192.168.21.15 1
4 2019-09-17 19:20:47.601051 52.18.181.18 1
5 2019-09-17 19:20:47.863428 52.18.181.17 1
6 2019-09-17 19:20:48.418591 52.18.181.18 2
7 2019-09-17 19:20:48.596764 52.18.181.17 2
8 2019-09-17 19:20:49.057553 192.168.21.15 1
9 2019-09-17 19:20:49.153256 192.168.21.15 2
10 2019-09-17 19:20:49.712312 127.0.0.1 1
11 2019-09-17 19:20:50.000119 52.18.181.17 2
12 2019-09-17 19:20:50.248562 52.18.181.18 2
13 2019-09-17 19:20:50.603783 52.18.181.18 2
14 2019-09-17 19:20:50.921631 52.18.181.17 2
15 2019-09-17 19:20:51.366193 52.18.181.18 3
16 2019-09-17 19:20:51.528611 52.18.181.18 4
17 2019-09-17 19:20:51.773429 131.53.97.59 1
18 2019-09-17 19:20:52.618215 192.168.21.15 1
19 2019-09-17 19:20:52.936181 52.18.181.18 3
它是根据这些数据生成的:
import io
import pandas as pd
raw=\
"""Time_stamp destination_address_IP
2019-09-17T19:20:45.093209 157.111.73.31
2019-09-17T19:20:45.297932 127.0.0.1
2019-09-17T19:20:45.750725 157.111.73.31
2019-09-17T19:20:46.787009 192.168.21.15
2019-09-17T19:20:47.601051 52.18.181.18
2019-09-17T19:20:47.863428 52.18.181.17
2019-09-17T19:20:48.418591 52.18.181.18
2019-09-17T19:20:48.596764 52.18.181.17
2019-09-17T19:20:49.057553 192.168.21.15
2019-09-17T19:20:49.153256 192.168.21.15
2019-09-17T19:20:49.712312 127.0.0.1
2019-09-17T19:20:50.000119 52.18.181.17
2019-09-17T19:20:50.248562 52.18.181.18
2019-09-17T19:20:50.603783 52.18.181.18
2019-09-17T19:20:50.921631 52.18.181.17
2019-09-17T19:20:51.366193 52.18.181.18
2019-09-17T19:20:51.528611 52.18.181.18
2019-09-17T19:20:51.773429 131.53.97.59
2019-09-17T19:20:52.618215 192.168.21.15
2019-09-17T19:20:52.936181 52.18.181.18
"""
df= pd.read_csv(
io.StringIO(raw),
sep='\s{2,}', dtype={
'Time_stamp': 'datetime64',
'destination_address_IP': 'str'},
engine='python')