【问题标题】:Merging values from a dataframe after filtering the values by the last date按最后日期过滤值后合并数据框中的值
【发布时间】:2021-06-16 09:52:43
【问题描述】:

我有 2 个数据框:

df_1

product_id      qty_received    date_received
a_1             62              2021-06-11
a_2             30              2021-06-11
a_3             30              2021-06-11
a_4             1               2021-05-24
a_5             1               2021-05-24
a_1             20              2021-05-23  # repeating product_id

df_2

product_id
a_1
b_2
c_4
a_3
a_5
e_5

我正在尝试从df_1 加入qty_received 和最后一个date_receiveddf_2,以便结果如下所示:

product_id      last_receive    qty_received
a_1             2021-06-11      62
b_2             No information  0
c_4             No information  0
a_3             2021-06-11      30
a_5             2021-05-24      1
e_5             No information  0

我尝试过的:

df_2.merge(df_1, on='product_id', how='left')

但是由于某种原因,这会增加总行数,我知道它可能会创建新行,因为在df_1 中有不止一个相同的product_id,但在df_2 中没有。

然后我尝试将其分组并取max date_received

df_1.groupby(['product_id'])['date_received', 'qty_received'].max().reset_index()

但这会返回date_receivedqty_received 的最大值,而不是max date_receivedqty_received

如何过滤掉最大的date_received 并获得该日期的product_id qty_received?如果我想获得最后 2 个日期,以便在每个产品的第二高 date_received 中再增加 2 个列 second_last_receivedsecond_qty_received,该怎么办? 所以结果是:

product_id      last_receive    qty_received        second_last_receive    second_qty_received
a_1             2021-06-11      62                  2021-05-23             20
b_2             No information  0                   No information         No information 
c_4             No information  0                   No information         No information 
a_3             2021-06-11      30                  No information         No information 
a_5             2021-05-24      1                   No information         No information 
e_5             No information  0                   No information         No information 

【问题讨论】:

  • 您只需要最后一行和最后一行?
  • 我需要 2 个最后一个日期,对应的 qty_received
  • 当然,所以应该删除另一个?如果有第 3 个,第 4 个...

标签: python pandas


【解决方案1】:

用途:

#converted values to datetimes
df_1['date_received'] = pd.to_datetime(df_1['date_received'])
#sorting per date_received
df = df_1.sort_values(by="date_received", ascending=False)
#created counter column per product_id (already sorted, so by descending dates)
df['g'] = df.groupby(['product_id'])['date_received'].cumcount()

#filter last and last previous only rows
df = df[df['g'] < 2]

#dict for rename MultiIndex levels from counter
d = {0:'last', 1:'second_last'}
#rehape by unstack, sorting by second level
df = df.set_index(['product_id','g']).unstack().sort_index(axis=1,level=1).rename(columns=d)
#flatten MutiIndex
df.columns = df.columns.map(lambda x: f'{x[1]}_{x[0]}')
#joined df_2 and repalaced NaNs
df = df_2.join(df, on='product_id').fillna({'last_qty_received':0}).fillna("No information")

print (df)
  product_id   last_date_received  last_qty_received  \
0        a_1  2021-06-11 00:00:00               62.0   
1        b_2       No information                0.0   
2        c_4       No information                0.0   
3        a_3  2021-06-11 00:00:00               30.0   
4        a_5  2021-05-24 00:00:00                1.0   
5        e_5       No information                0.0   

  second_last_date_received second_last_qty_received  
0       2021-05-23 00:00:00                     20.0  
1            No information           No information  
2            No information           No information  
3            No information           No information  
4            No information           No information  
5            No information           No information  

【讨论】:

  • 我在df= df.set_index(['product_id',g.replace(d)]).unstack() 收到ValueError: Length mismatch: Expected 637 rows, received array of length 764,知道为什么会这样吗?
  • @JonasPalačionis - 我更改了创建 g 列的答案,也许这会有所帮助。老实说不知道。
  • 你能看看这个question
【解决方案2】:

您可以将 groupby 与 idxmax 一起使用:

样本数据:

import pandas as pd
df1 = pd.DataFrame({'product_id': {0: 'a_1', 1: 'a_2', 2: 'a_3', 3: 'a_4', 4: 'a_5', 5: 'a_1'},
 'qty_received': {0: 62, 1: 30, 2: 30, 3: 1, 4: 1, 5: 20},
 'date_received': {0: '2021-06-11',
  1: '2021-06-11',
  2: '2021-06-11',
  3: '2021-05-24',
  4: '2021-05-24',
  5: '2021-05-23'}})
df1['date_received'] = pd.to_datetime(df1['date_received'])
df2 = pd.DataFrame({'product_id': {0: 'a_1', 1: 'b_2', 2: 'c_4', 3: 'a_3', 4: 'a_5', 5: 'e_5'}})

代码:

df1 = df1.loc[df1.groupby(['product_id'])['date_received'].idxmax()].set_index('product_id')
df2.set_index('product_id').join(df1)

输出:

            qty_received date_received
product_id                            
a_1                 62.0    2021-06-11
b_2                  NaN           NaT
c_4                  NaN           NaT
a_3                 30.0    2021-06-11
a_5                  1.0    2021-05-24
e_5                  NaN           NaT

第二个问题,如果还有更多日期需要考虑怎么办:

然后你可以这样使用:.rank()

df1['rank'] = df1.groupby(['product_id'])['date_received'].rank(method='max', ascending=False)
df1 = df1.pivot(index='product_id', columns='rank').swaplevel(axis=1).sort_index(axis=1,level=[0,1],ascending=[True,False])
df2.set_index('product_id').join(df1)

【讨论】:

    【解决方案3】:

    我会按接收日期对 df1 中的值进行排序,然后删除重复的 product_ids 并保留最后一个值:

    df1_temp = df1.sort_values(by="date_received").drop_duplicates("product_id", keep="last")
    
      product_id  qty_received date_received
    3        a_4             1    2021-05-24
    4        a_5             1    2021-05-24
    0        a_1            62    2021-06-11
    1        a_2            30    2021-06-11
    2        a_3            30    2021-06-11
    

    然后您可以毫无问题地使用您的合并代码:

    df2_merged = df_2.merge(df1_temp, on='product_id', how='left')
    df2_merged["qtr_received"].fillna(0, inplace=True)
    df2_merged["date_received"].fillna("No information", inplace=True)
    
    output:
      product_id  qty_received   date_received
    0        a_1          62.0      2021-06-11
    1        b_2           0.0  No information
    2        c_4           0.0  No information
    3        a_3          30.0      2021-06-11
    4        a_5           1.0      2021-05-24
    5        e_5           0.0  No information
    

    从第一个数据帧中获取不是最后一次使用的项目:

    not_last = df1[~df1.isin(df1_temp)].dropna()
    

    然后按照类似的程序按日期排序并删除重复项:

    second_last = second_recieved.sort_values(by="date_received").drop_duplicates("product_id", keep="last")
    

    再次合并数据框:

    df2_second_merge = df_2.merge(second_last, on='product_id', how='left')
    

    然后加入两个数据框:

    df2_new.join(df2_second_merge.set_index("product_id"), on ="product_id", rsuffix="_second")
    

    这可以封装在一个函数中,以执行任意数量的级别。但是,如果需要,我会将其留给您

    【讨论】:

    • 但这消除了拥有我想要拥有的second_last_date_received 的可能性。
    猜你喜欢
    • 2017-07-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-06-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-04-11
    相关资源
    最近更新 更多