【问题标题】:Split date range rows into years (ungroup) - Python Pandas将日期范围行拆分为年份(取消分组) - Python Pandas
【发布时间】:2019-11-08 08:29:19
【问题描述】:

我有一个这样的数据框:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2023      1    2
    .......

我想拆分 end - start > 1 年的行(请参阅 end=2023 和 start = 2020 的最后一行),保持 A 列的值相同,同时按比例拆分 B 列中的值:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2020      1    2/4
    01.01.2021  31.12.2021      1    2/4
    01.01.2022  31.12.2022      1    2/4
    01.01.2023  31.12.2023      1    2/4
    .......

有什么想法吗?

【问题讨论】:

    标签: python pandas date dataframe


    【解决方案1】:

    这是我的解决方案。请参阅下面的 cmets:

    import io
    
    # TEST DATA:
    text="""     start         end      A      B 
            01.01.2020  30.06.2020      2      3 
            01.01.2020  31.12.2020      3      1 
            01.04.2020  30.04.2020      6      2 
            01.01.2021  31.12.2021      2      3 
            01.07.2020  31.12.2020      8      2
            31.12.2020  20.01.2021     12     12
            31.12.2020  01.01.2021     22     22
            30.12.2020  01.01.2021     32     32
            10.05.2020  28.09.2023     44     44
            27.11.2020  31.12.2023     88     88
            31.12.2020  31.12.2023    100    100
            01.01.2020  31.12.2021    200    200
          """
    
    df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
    #print("\n----\n df:",df)
    
    #----------------------------------------
    # SOLUTION:
    
    def split_years(r):
        """
            Split row 'r' where "end"-"start" greater than 0.
            The new rows have repeated values of 'A', and 'B' divided by the number of years.
            Return: a DataFrame with rows per year.
        """
        t1,t2 = r["start"], r["end"]
        ys= t2.year - t1.year
        kk= 0 if t1.is_year_end else 1
        if ys>0:
            l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
            l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
            return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
        print("year difference <= 0!")
        return None
    
    
    # Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
    grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups 
    print("\n---- grps:\n",grps)
    
    # Extract the "one year" rows in a data frame:
    df1= df.loc[grps[False]]
    #print("\n---- df1:\n",df1)
    
    # Extract the rows to be splitted:
    df2= df.loc[grps[True]]
    print("\n---- df2:\n",df2)
    
    # Split the rows and put the resulting data frames into a list:
    ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
    print("\n---- ldfs:")
    for fr in ldfs:
        print(fr,"\n")
    
    # Insert the "one year" data frame to the list, and concatenate them:    
    ldfs.insert(0,df1)
    df_rslt= pd.concat(ldfs,sort=False)
    #print("\n---- df_rslt:\n",df_rslt)
    
    # Housekeeping:
    df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
    print("\n---- df_rslt:\n",df_rslt)
    

    输出:

    ---- grps:
     {False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
    
    ---- df2:
             start        end    A    B
    5  2020-12-31 2021-01-20   12   12
    6  2020-12-31 2021-01-01   22   22
    7  2020-12-30 2021-01-01   32   32
    8  2020-10-05 2023-09-28   44   44
    9  2020-11-27 2023-12-31   88   88
    10 2020-12-31 2023-12-31  100  100
    11 2020-01-01 2021-12-31  200  200
    
    ---- ldfs:
           start        end   A    B
    0 2020-12-31 2020-12-31  12  6.0
    1 2021-01-01 2021-01-20  12  6.0 
    
           start        end   A     B
    0 2020-12-31 2020-12-31  22  11.0
    1 2021-01-01 2021-01-01  22  11.0 
    
           start        end   A     B
    0 2020-12-30 2020-12-31  32  16.0
    1 2021-01-01 2021-01-01  32  16.0 
    
           start        end   A     B
    0 2020-10-05 2020-12-31  44  11.0
    1 2021-01-01 2021-12-31  44  11.0
    2 2022-01-01 2022-12-31  44  11.0
    3 2023-01-01 2023-09-28  44  11.0 
    
           start        end   A     B
    0 2020-11-27 2020-12-31  88  22.0
    1 2021-01-01 2021-12-31  88  22.0
    2 2022-01-01 2022-12-31  88  22.0
    3 2023-01-01 2023-12-31  88  22.0 
    
           start        end    A     B
    0 2020-12-31 2020-12-31  100  25.0
    1 2021-01-01 2021-12-31  100  25.0
    2 2022-01-01 2022-12-31  100  25.0
    3 2023-01-01 2023-12-31  100  25.0 
    
           start        end    A      B
    0 2020-01-01 2020-12-31  200  100.0
    1 2021-01-01 2021-12-31  200  100.0 
    
    
    ---- df_rslt:
             start        end    A      B
    0  2020-01-01 2020-06-30    2    3.0
    1  2020-01-01 2020-12-31    3    1.0
    2  2020-01-01 2020-12-31  200  100.0
    3  2020-01-04 2020-04-30    6    2.0
    4  2020-01-07 2020-12-31    8    2.0
    5  2020-10-05 2020-12-31   44   11.0
    6  2020-11-27 2020-12-31   88   22.0
    7  2020-12-30 2020-12-31   32   16.0
    8  2020-12-31 2020-12-31   12    6.0
    9  2020-12-31 2020-12-31  100   25.0
    10 2020-12-31 2020-12-31   22   11.0
    11 2021-01-01 2021-12-31  100   25.0
    12 2021-01-01 2021-12-31   88   22.0
    13 2021-01-01 2021-12-31   44   11.0
    14 2021-01-01 2021-01-01   32   16.0
    15 2021-01-01 2021-01-01   22   11.0
    16 2021-01-01 2021-01-20   12    6.0
    17 2021-01-01 2021-12-31    2    3.0
    18 2021-01-01 2021-12-31  200  100.0
    19 2022-01-01 2022-12-31   88   22.0
    20 2022-01-01 2022-12-31  100   25.0
    21 2022-01-01 2022-12-31   44   11.0
    22 2023-01-01 2023-09-28   44   11.0
    23 2023-01-01 2023-12-31   88   22.0
    24 2023-01-01 2023-12-31  100   25.0
    

    【讨论】:

      【解决方案2】:

      有点不同的方法,添加新列而不是新行。但我认为这完成了你想做的事情。

      df["years_apart"] = (
          (df["end_date"] - df["start_date"]).dt.days / 365
      ).astype(int)
      
      for years in range(1, df["years_apart"].max().astype(int)):
          df[f"{years}_end_date"] = pd.NaT
          df.loc[
              df["years_apart"] == years, f"{years}_end_date"
          ] = df.loc[
              df["years_apart"] == years, "start_date"
          ]  + dt.timedelta(days=365*years)
      
      df["B_bis"] = df["B"] / df["years_apart"]
      

      输出

      start_date     end_date    years_apart     1_end_date   2_end_date   ... 
      2018-01-01    2018-01-02      0            NaT          NaT
      2018-01-02    2019-01-02      1            2019-01-02   NaT
      2018-01-03    2020-01-03      2            NaT          2020-01-03
      

      【讨论】:

      • 你能发布你的输出吗?
      【解决方案3】:

      我已经解决了这个问题,创建了一个日期差异和一个将年份添加到重复行的计数器:

      #calculate difference between start and end year
      table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
      table['diff'] = table['diff']+1
      
      #replicate rows depending on number of years
      table = table.reindex(table.index.repeat(table['diff']))
      
      
      #counter that increase for diff>1, assign increasing years to the replicated rows
      table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
      table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
      table['end'] = table['start']
      
      #split B among years
      table['B'] = table['B']//table['diff']
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-02-28
        • 1970-01-01
        • 2020-02-28
        • 2013-11-09
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多