按最后日期过滤值后合并数据框中的值答案

【问题标题】：Merging values from a dataframe after filtering the values by the last date按最后日期过滤值后合并数据框中的值
【发布时间】：2021-06-16 09:52:43
【问题描述】：

我有 2 个数据框：

df_1

product_id      qty_received    date_received
a_1             62              2021-06-11
a_2             30              2021-06-11
a_3             30              2021-06-11
a_4             1               2021-05-24
a_5             1               2021-05-24
a_1             20              2021-05-23  # repeating product_id

df_2

product_id
a_1
b_2
c_4
a_3
a_5
e_5

我正在尝试从df_1 加入qty_received 和最后一个date_received 到df_2，以便结果如下所示：

product_id      last_receive    qty_received
a_1             2021-06-11      62
b_2             No information  0
c_4             No information  0
a_3             2021-06-11      30
a_5             2021-05-24      1
e_5             No information  0

我尝试过的：

df_2.merge(df_1, on='product_id', how='left')

但是由于某种原因，这会增加总行数，我知道它可能会创建新行，因为在df_1 中有不止一个相同的product_id，但在df_2 中没有。

然后我尝试将其分组并取max date_received：

df_1.groupby(['product_id'])['date_received', 'qty_received'].max().reset_index()

但这会返回date_received 和qty_received 的最大值，而不是max date_received 的qty_received。

如何过滤掉最大的date_received 并获得该日期的product_id qty_received？如果我想获得最后 2 个日期，以便在每个产品的第二高 date_received 中再增加 2 个列 second_last_received 和 second_qty_received，该怎么办？所以结果是：

product_id      last_receive    qty_received        second_last_receive    second_qty_received
a_1             2021-06-11      62                  2021-05-23             20
b_2             No information  0                   No information         No information 
c_4             No information  0                   No information         No information 
a_3             2021-06-11      30                  No information         No information 
a_5             2021-05-24      1                   No information         No information 
e_5             No information  0                   No information         No information

【问题讨论】：

您只需要最后一行和最后一行？
我需要 2 个最后一个日期，对应的 qty_received。
当然，所以应该删除另一个？如果有第 3 个，第 4 个...

标签： python pandas

【解决方案1】：

用途：

#converted values to datetimes
df_1['date_received'] = pd.to_datetime(df_1['date_received'])
#sorting per date_received
df = df_1.sort_values(by="date_received", ascending=False)
#created counter column per product_id (already sorted, so by descending dates)
df['g'] = df.groupby(['product_id'])['date_received'].cumcount()

#filter last and last previous only rows
df = df[df['g'] < 2]

#dict for rename MultiIndex levels from counter
d = {0:'last', 1:'second_last'}
#rehape by unstack, sorting by second level
df = df.set_index(['product_id','g']).unstack().sort_index(axis=1,level=1).rename(columns=d)
#flatten MutiIndex
df.columns = df.columns.map(lambda x: f'{x[1]}_{x[0]}')
#joined df_2 and repalaced NaNs
df = df_2.join(df, on='product_id').fillna({'last_qty_received':0}).fillna("No information")

print (df)
  product_id   last_date_received  last_qty_received  \
0        a_1  2021-06-11 00:00:00               62.0   
1        b_2       No information                0.0   
2        c_4       No information                0.0   
3        a_3  2021-06-11 00:00:00               30.0   
4        a_5  2021-05-24 00:00:00                1.0   
5        e_5       No information                0.0   

  second_last_date_received second_last_qty_received  
0       2021-05-23 00:00:00                     20.0  
1            No information           No information  
2            No information           No information  
3            No information           No information  
4            No information           No information  
5            No information           No information

【讨论】：

我在df= df.set_index(['product_id',g.replace(d)]).unstack() 收到ValueError: Length mismatch: Expected 637 rows, received array of length 764，知道为什么会这样吗？
@JonasPalačionis - 我更改了创建 g 列的答案，也许这会有所帮助。老实说不知道。
你能看看这个question

【解决方案2】：

您可以将 groupby 与 idxmax 一起使用：

样本数据：

import pandas as pd
df1 = pd.DataFrame({'product_id': {0: 'a_1', 1: 'a_2', 2: 'a_3', 3: 'a_4', 4: 'a_5', 5: 'a_1'},
 'qty_received': {0: 62, 1: 30, 2: 30, 3: 1, 4: 1, 5: 20},
 'date_received': {0: '2021-06-11',
  1: '2021-06-11',
  2: '2021-06-11',
  3: '2021-05-24',
  4: '2021-05-24',
  5: '2021-05-23'}})
df1['date_received'] = pd.to_datetime(df1['date_received'])
df2 = pd.DataFrame({'product_id': {0: 'a_1', 1: 'b_2', 2: 'c_4', 3: 'a_3', 4: 'a_5', 5: 'e_5'}})

代码：

df1 = df1.loc[df1.groupby(['product_id'])['date_received'].idxmax()].set_index('product_id')
df2.set_index('product_id').join(df1)

输出：

            qty_received date_received
product_id                            
a_1                 62.0    2021-06-11
b_2                  NaN           NaT
c_4                  NaN           NaT
a_3                 30.0    2021-06-11
a_5                  1.0    2021-05-24
e_5                  NaN           NaT

第二个问题，如果还有更多日期需要考虑怎么办：

然后你可以这样使用：.rank()：

df1['rank'] = df1.groupby(['product_id'])['date_received'].rank(method='max', ascending=False)
df1 = df1.pivot(index='product_id', columns='rank').swaplevel(axis=1).sort_index(axis=1,level=[0,1],ascending=[True,False])
df2.set_index('product_id').join(df1)

【讨论】：

【解决方案3】：

我会按接收日期对 df1 中的值进行排序，然后删除重复的 product_ids 并保留最后一个值：

df1_temp = df1.sort_values(by="date_received").drop_duplicates("product_id", keep="last")

  product_id  qty_received date_received
3        a_4             1    2021-05-24
4        a_5             1    2021-05-24
0        a_1            62    2021-06-11
1        a_2            30    2021-06-11
2        a_3            30    2021-06-11

然后您可以毫无问题地使用您的合并代码：

df2_merged = df_2.merge(df1_temp, on='product_id', how='left')
df2_merged["qtr_received"].fillna(0, inplace=True)
df2_merged["date_received"].fillna("No information", inplace=True)

output:
  product_id  qty_received   date_received
0        a_1          62.0      2021-06-11
1        b_2           0.0  No information
2        c_4           0.0  No information
3        a_3          30.0      2021-06-11
4        a_5           1.0      2021-05-24
5        e_5           0.0  No information

从第一个数据帧中获取不是最后一次使用的项目：

not_last = df1[~df1.isin(df1_temp)].dropna()

然后按照类似的程序按日期排序并删除重复项：

second_last = second_recieved.sort_values(by="date_received").drop_duplicates("product_id", keep="last")

再次合并数据框：

df2_second_merge = df_2.merge(second_last, on='product_id', how='left')

然后加入两个数据框：

df2_new.join(df2_second_merge.set_index("product_id"), on ="product_id", rsuffix="_second")

这可以封装在一个函数中，以执行任意数量的级别。但是，如果需要，我会将其留给您

【讨论】：

但这消除了拥有我想要拥有的second_last_date_received 的可能性。