【问题标题】:Arranging call data from salesforce in 15 minute intervals每隔 15 分钟安排一次来自 salesforce 的呼叫数据
【发布时间】:2021-11-20 21:44:51
【问题描述】:

我是 python 和 pandas 以及 stackoverflow 的新手,所以对于我提前犯的任何错误,我深表歉意。

我有这个数据框

df = pd.DataFrame(
    data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
          ['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ],
          ['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
          ['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
    columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
df['interval_start'] = pd.to_datetime(df['interval_start'])
df['interval_end'] = pd.to_datetime(df['interval_end'])

输出是

specialist  language    interval_start  interval_end    status_duration
0   Donald Trump    German  2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1   Donald Trump    German  2021-09-23 14:58:01 2021-09-23 15:00:05 125
2   Donald Trump    German  2021-09-24 10:05:00 2021-09-24 10:15:30 630
3   Monica Lewinsky German  2021-09-24 10:05:00 2021-09-24 10:15:30 630

我想要的结果是如下所示

specialist  language    interval    status_duration
0   Donald Trump    German  2021-9-23 14:15:00  120
1   Donald Trump    German  2021-9-23 14:30:00  900
2   Donald Trump    German  2021-9-23 14:45:00  899
3   Donald Trump    German  2021-9-23 15:00:00  5
4   Donald Trump    German  2021-9-24 10:00:00  600
5   Donald Trump    German  2021-9-24 10:15:00  30
6   Monica Lewinsky German  2021-9-24 10:15:00  30

我有来自另一个主题link的这段代码

ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False)
         .agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]),
              Interval=("Interval Start", "first"))
         .explode("status_duration"))

ref["Interval"] = ref["Interval"].dt.floor("15min")+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")

但它没有考虑“interval_start”,我需要先检查 status_duration 是否会保持相同的 15 分钟间隔。希望有人可以提供帮助,因为这对我来说是一个非常高级的问题,我正在研究它超过 10 天。

【问题讨论】:

  • 为什么代码被硬编码为 900 ?我们如何知道何时编写代码才能知道我们需要对 900 进行硬编码?
  • 因为每个间隔有900秒(15分钟),状态持续时间不能超过这个量。

标签: python pandas datetime intervals explode


【解决方案1】:

在了解了更多之后,我想出了另一个(更好的)解决方案,使用 groupby()explode()。自第一个答案以来,我将其添加为第二个答案,虽然可能有点复杂,但仍然有效,并且我还在此答案中引用了其中的一部分。


我首先添加了一些新列,将status_duration 拆分为第一个切片和其余部分,并将@​​987654325@ 的原始值替换为相应的 2 元素列表:

df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])

这为您提供了以下准备好的数据框:

        specialist language      interval_start  ... status_duration first  rest
0     Donald Trump   German 2021-09-23 14:28:00  ...     [120, 1680]   120  1680
1     Donald Trump   German 2021-09-23 14:58:01  ...        [119, 5]   119     5
2     Donald Trump   German 2021-09-24 10:05:00  ...       [600, 30]   600    30
3  Monica Lewinsky   German 2021-09-24 10:05:00  ...         [30, 0]   600  -570

在此,您现在可以执行类似于您问题中的代码的groupby()explode()。之后,由于explode(),您现在对间隔进行四舍五入并再次分组以合并具有多个条目的间隔。为了清理,我删除了持续时间为0 的行并重置了索引:

ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
        .agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
        .explode('status_duration')

ref['interval_start'] = ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')

ref = ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref = ref[ref.status_duration != 0].reset_index()

这将为您提供所需的输出:

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              899
3     Donald Trump   German 2021-09-23 15:00:00                5
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

注意:我在另一个答案中描述的问题是,最后的分组步骤可能导致status_duration > 900 不应该使用真实数据,因为专家不应该能够在此之前开始第二个间隔第一个结束。所以这是一个你根本不需要处理的情况。

【讨论】:

  • 非常感谢 buddlemat,我可以肯定地说这是一个高级问题,但你解决得像个国王。
  • 很高兴为您提供帮助!一路走来,我学到了很多... ;)
  • 你好,buddemat,你想要另一个更简单的挑战吗?我试图将您的代码更改为类似的东西,但我无法将其分解stackoverflow.com/questions/69586053/…
【解决方案2】:

不确定这是否是不必要的复杂,但它确实完成了工作。不过可能有更好、更 Python 的方法...

我首先在 df 中添加了几个新列,其中包含status_duration 建议的间隔数、适合第一个间隔的分钟数和剩余的持续时间:

df['len'] = 1 + (df['status_duration']-1)//900

df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)

df['rest'] = df['status_duration'] - df['first']

然后,我们为每一行添加一个额外的间隔,其中包含一个正休息和第一个切片

df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])

现在,我通过使用 np.repeat() 创建新的数据框来复制行,以便我根据间隔数和列表理解获得正确的数字,以使用 df.iterrows() 构建 interval_startstatus_duration 列:

new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
                 'language': np.repeat(df['language'], df['len']),
                 'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
                 'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})

然后我们对区间开始时间进行四舍五入

new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')

现在剩下要做的就是分组和重置索引:

new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()

结果:

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              899
3     Donald Trump   German 2021-09-23 15:00:00                5
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

仍然存在一个问题:最后的分组步骤可能会导致 15 分钟的间隔再次通过分组得到 status_duration > 900。

假设您的输入数据的第二行有一个 interval_start,它早于 2 秒:

        specialist language      interval_start        interval_end  status_duration
0     Donald Trump   German 2021-09-23 14:28:00 2021-09-23 14:58:00             1800
1     Donald Trump   German 2021-09-23 14:57:59 2021-09-23 15:00:03              124
2     Donald Trump   German 2021-09-24 10:05:00 2021-09-24 10:15:30              630 
3  Monica Lewinsky   German 2021-09-24 10:05:00 2021-09-24 10:05:30               30 

那么你会在分组后得到status_duration901

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              901
3     Donald Trump   German 2021-09-23 15:00:00                3
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

这种“溢出”可能会发生多次,这一事实使情况变得复杂。一种方法是重复上述步骤,直到没有new_df 行与status_duration > 900 剩余。这将结转溢出。

完整示例:

import pandas as pd
import numpy as np
from datetime import timedelta

input_df = pd.DataFrame(
    data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
          ['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
          ['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
          ['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
    columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])

def build_df(df):
    while df['status_duration'].gt(900).any():
        df['len'] = 1 + (df['status_duration']-1)//900
        df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
        df['rest'] = df['status_duration'] - df['first']
        df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
        new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
                 'language': np.repeat(df['language'], df['len']),
                 'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
                 'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
        })
        new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
        new_df = new_df[new_df.status_duration != 0]
        new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
        df = new_df.copy()
    return df

output_df = build_df(input_df)

结果:

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              900
3     Donald Trump   German 2021-09-23 15:00:00                4
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

现在看,我猜应该有一个更简单的方法,但这就是我得到的全部......

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-08-28
    • 1970-01-01
    • 2018-07-05
    • 2015-08-31
    相关资源
    最近更新 更多