【问题标题】:Pandas dataframe conditional cumulative sum based on date range基于日期范围的 Pandas 数据框条件累积和
【发布时间】:2020-10-13 01:57:43
【问题描述】:

我有一个熊猫数据框:

         Date            Party    Status
-------------------------------------------
0        01-01-2018      John     Sent
1        13-01-2018      Lisa     Received
2        15-01-2018      Will     Received
3        19-01-2018      Mark     Sent
4        02-02-2018      Will     Sent
5        28-02-2018      John     Received

我想添加执行.cumsum() 的新列,但它以日期为条件。它看起来像这样:

                                                Num of Sent         Num of Received
         Date            Party    Status        in Past 30 Days     in Past 30 Days
-----------------------------------------------------------------------------------
0        01-01-2018      John     Sent          1                   0
1        13-01-2018      Lisa     Received      1                   1
2        15-01-2018      Will     Received      1                   2
3        19-01-2018      Mark     Sent          2                   2
4        02-02-2018      Will     Sent          2                   2
5        28-02-2018      John     Received      1                   1

我设法通过编写以下代码来实现我所需要的:

def inner_func(date_var, status_var, date_array, status_array):
    sent_increment = 0
    received_increment = 0

    for k in range(0, len(date_array)):
        if((date_var - date_array[k]).days <= 30):
            if(status_array[k] == "Sent"):
                sent_increment += 1
            elif(status_array[k] == "Received"):
                received_increment += 1

    return sent_increment, received_increment
import pandas as pd
import time
df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "15-01-2018", "19-01-2018", "02-02-2018", "28-02-2018"]),
                   "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"],
                   "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]})

df = df.sort_values("Date")
date_array = []
status_array = []

for i in range(0, len(df)):
        date_var = df.loc[i,"Date"]
        date_array.append(date_var)
        status_var = df.loc[i,"Status"]
        status_array.append(status_var)
        sent_count, received_count = inner_func(date_var, status_var, date_array, status_array)
        df.loc[i, "Num of Sent in Past 30 days"] = sent_count
        df.loc[i, "Num of Received in Past 30 days"] = received_count

但是,当df 很大时,该过程的计算量很大并且速度很慢,因为嵌套循环会遍历数据帧两次。有没有更 Pythonic 的方式来实现我想要实现的目标,而无需以我正在做的方式遍历数据框?

更新 2

Michael 提供了我正在寻找的解决方案:here。假设我想将解决方案应用于groupby 对象。例如,使用滚动解决方案计算每一方的累积总和:

                                                Sent past 30       Received past 30
         Date            Party    Status        days by party      days by party
-----------------------------------------------------------------------------------
0        01-01-2018      John     Sent          1                   0
1        13-01-2018      Lisa     Received      0                   1
2        15-01-2018      Will     Received      0                   1
3        19-01-2018      Mark     Sent          1                   0
4        02-02-2018      Will     Sent          1                   1
5        28-02-2018      John     Received      0                   1

我已尝试使用下面的groupby 方法重新生成解决方案:

l = []
grp_obj = df.groupby("Party")
grp_obj.rolling('30D',  min_periods=1)["dummy"].apply(lambda x: l.append(x.value_counts()) or 0)
df.reset_index(inplace=True)

但我最终得到了不正确的值。我知道这是因为concat 方法正在组合数据帧而不考虑它们的索引,因为groupby 以不同的方式对数据进行排序。有没有办法可以修改附加列表以包含原始索引,以便我可以将 value_counts 数据框合并/加入原始索引?

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    如果您将Date 设置为索引并将Status 临时转换为分类,您可以通过一些小技巧使用pd.rolling

    df = df.set_index('Date')
    df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
    l = []
    df.rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
    df.reset_index(inplace=True)
    pd.concat(
        [df,
        (pd.DataFrame(l)
            .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
            .fillna(0)
            .astype('int'))
        ], axis=1).drop('dummy', 1)
    

    输出:

            Date Party    Status  Received past 30 Days  Sent past 30 Days
    0 2018-01-01  John      Sent                      0                  1
    1 2018-01-13  Lisa  Received                      1                  1
    2 2018-01-15  Will  Received                      2                  1
    3 2018-01-19  Mark      Sent                      2                  2
    4 2018-02-02  Will      Sent                      2                  2
    5 2018-02-28  John  Received                      1                  1
    

    维护一个原始索引以允许后续合并

    稍微调整数据在Dateindex有不同的序列

    df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "03-01-2018", "19-01-2018", "08-02-2018", "22-02-2018"]),
                       "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"],
                       "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]})
    df
    

    输出:

            Date Party    Status
    0 2018-01-01  John      Sent
    1 2018-01-13  Lisa  Received
    2 2018-03-01  Will  Received
    3 2018-01-19  Mark      Sent
    4 2018-08-02  Will      Sent
    5 2018-02-22  John  Received
    

    Date排序后存储原始索引,对按Date排序的数据框进行操作后重新索引

    df = df.sort_values('Date')
    df = df.reset_index()
    df = df.set_index('Date')
    df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
    l = []
    df.rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
    df.reset_index(inplace=True)
    df = pd.concat(
          [df,
          (pd.DataFrame(l)
              .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
              .fillna(0)
              .astype('int'))
          ], axis=1).drop('dummy', 1)
    df.set_index('index')
    

    输出:

                Date Party    Status  Received past 30 Days  Sent past 30 Days
    index                                                                     
    0     2018-01-01  John      Sent                      0                  1
    1     2018-01-13  Lisa  Received                      1                  1
    3     2018-01-19  Mark      Sent                      1                  2
    5     2018-02-22  John  Received                      1                  0
    2     2018-03-01  Will  Received                      2                  0
    4     2018-08-02  Will      Sent                      0                  1
    

    按组计数值

    首先按PartyDate 排序以获得附加分组计数的正确顺序

    df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "15-01-2018", "19-01-2018", "02-02-2018", "28-02-2018"]),
                       "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"],
                       "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]})
    df = df.sort_values(['Party','Date'])
    

    concat 之前重新索引以附加到正确的行

    df = df.set_index('Date')
    df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
    l = []
    df.groupby('Party').rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
    df.reset_index(inplace=True)
    
    pd.concat(
          [df,
          (pd.DataFrame(l)
              .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
              .fillna(0)
              .astype('int'))
          ], axis=1).drop('dummy', 1).sort_values('Date')
    

    输出:

            Date Party    Status  Received past 30 Days  Sent past 30 Days
    0 2018-01-01  John      Sent                      0                  1
    2 2018-01-13  Lisa  Received                      1                  0
    4 2018-01-15  Will  Received                      1                  0
    3 2018-01-19  Mark      Sent                      0                  1
    5 2018-02-02  Will      Sent                      1                  1
    1 2018-02-28  John  Received                      1                  0
    

    微基准

    由于该解决方案也在对数据集进行迭代,因此我比较了两种方法的运行时间。由于原始解决方案的运行时间增长很快,因此只使用了非常小的数据集。

    结果

    重现基准的代码

    import pandas as pd
    import perfplot
    
    def makedata(n=1):
      df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "15-01-2018", "19-01-2018", "02-02-2018", "28-02-2018"]*n),
                       "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"]*n,
                       "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]*n})
    
      return df.sort_values("Date")
    
    def rolling(df):
      df = df.set_index('Date')
      df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
      l = []
      df.rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
      df.reset_index(inplace=True)
      return pd.concat(
          [df,
          (pd.DataFrame(l)
              .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
              .fillna(0)
              .astype('int'))
          ], axis=1).drop('dummy', 1)
    
    def forloop(df):
      date_array = []
      status_array = []
      def inner_func(date_var, status_var, date_array, status_array):
          sent_increment = 0
          received_increment = 0
    
          for k in range(0, len(date_array)):
              if((date_var - date_array[k]).days <= 30):
                  if(status_array[k] == "Sent"):
                      sent_increment += 1
                  elif(status_array[k] == "Received"):
                      received_increment += 1
    
          return sent_increment, received_increment
    
      for i in range(0, len(df)):
              date_var = df.loc[i,"Date"]
              date_array.append(date_var)
              status_var = df.loc[i,"Status"]
              status_array.append(status_var)
              sent_count, received_count = inner_func(date_var, status_var, date_array, status_array)
              df.loc[i, "Num of Sent in Past 30 days"] = sent_count
              df.loc[i, "Num of Received in Past 30 days"] = received_count
      return df
    
    perfplot.show(
        setup=makedata,
        kernels=[forloop, rolling],
        n_range=[x for x in range(5, 105, 5)],
        equality_check=None,
        xlabel='len(df)'
    )
    

    【讨论】:

    • 这是一个非常好的答案。有没有办法根据原始索引而不是日期顺序将项目附加到“l”列表中,或者将原始索引附加到单独的列表中?我将rolling 应用于groupby 方法,以便根据各方历史获取新参数,因此我试图维护joinmerge 方法的原始索引。
    • 我认为维护原始索引的一种简单方法是存储原始索引并在操作后重新索引。我为该要求添加了一个示例。
    • 抱歉,我的措辞不正确。维护索引本身非常简单,但维护 groupedby 对象的索引则不然。我更新了我的问题以包含我尝试实施的问题扩展。
    • 也添加了这个要求。请考虑在另一个问题中询问其他要求。对于其他读者来说,这个答案变得非常混乱且难以理解。
    • 非常感谢!不知道我怎么没想到按两列排序。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-07-20
    • 1970-01-01
    • 1970-01-01
    • 2015-12-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多