不确定这是否是不必要的复杂,但它确实完成了工作。不过可能有更好、更 Python 的方法...
我首先在 df 中添加了几个新列,其中包含status_duration 建议的间隔数、适合第一个间隔的分钟数和剩余的持续时间:
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
然后,我们为每一行添加一个额外的间隔,其中包含一个正休息和第一个切片
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
现在,我通过使用 np.repeat() 创建新的数据框来复制行,以便我根据间隔数和列表理解获得正确的数字,以使用 df.iterrows() 构建 interval_start 和 status_duration 列:
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
然后我们对区间开始时间进行四舍五入
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
现在剩下要做的就是分组和重置索引:
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
结果:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
仍然存在一个问题:最后的分组步骤可能会导致 15 分钟的间隔再次通过分组得到 status_duration > 900。
假设您的输入数据的第二行有一个 interval_start,它早于 2 秒:
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:57:59 2021-09-23 15:00:03 124
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:05:30 30
那么你会在分组后得到status_duration 和901:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 901
3 Donald Trump German 2021-09-23 15:00:00 3
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
这种“溢出”可能会发生多次,这一事实使情况变得复杂。一种方法是重复上述步骤,直到没有new_df 行与status_duration > 900 剩余。这将结转溢出。
完整示例:
import pandas as pd
import numpy as np
from datetime import timedelta
input_df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])
def build_df(df):
while df['status_duration'].gt(900).any():
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
new_df = new_df[new_df.status_duration != 0]
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
df = new_df.copy()
return df
output_df = build_df(input_df)
结果:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 900
3 Donald Trump German 2021-09-23 15:00:00 4
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
现在看,我猜应该有一个更简单的方法,但这就是我得到的全部......