当条件在第二个数据帧中匹配时对数据帧进行分箱答案

【问题标题】：Binning a dataframe when a condition matches in a second dataframe当条件在第二个数据帧中匹配时对数据帧进行分箱
【发布时间】：2021-12-16 15:47:44
【问题描述】：

大家早上好。我想使用第二个数据框中的数据在我的主数据框中创建一个分箱列。 Dataframe#1 具有“Runner ID”和“Cumulative Distance”列。 Dataframe#2 具有“Runner ID”、“Section Start”和“Section Name”列我正在尝试根据两个数据帧中匹配的“Runner ID”在 Dataframe#1 上创建名为“Section Name Binning”的第三列，然后使用“Section Start”列中的数据从 Dataframe#1 中分箱“累积距离”和 Dataframe#2 中的“部分名称”。 Dataframe#1 的“Cumulative Distance”和 Dataframe#2 的“Section Start”将始终按递增顺序排列，并在“Runner ID”更改时重新启动。附上一张图片和数据框示例。一如既往地感谢您的支持。

用于分箱的数据框

df1=pd.DataFrame({'Runner_ID':['John','John','John','John','John','John','John','John','John','John','John','Jen','Jen','Jen','Jen','Jen','Jen','Jen','Jen','Jen','Jen','Jen'],'Cumulative_Distance':[1,1.5,1.8,3,3.2,3.7,4,4.3,5,6.6,8,2,2.3,2.8,3.2,3.5,3.9,4.8,5,5.3,5.8,6]})

df2=pd.DataFrame({'Runner_ID':['John','John','John','Jen','Jen','Jen','Jen'],'Section_Start':[0,3,5,0,2.5,3.5,5], 'Section_Name':['Flats', 'Uphill', 'Downhill', 'Flats', 'Uphill','Curve', 'Downhill']})

【问题讨论】：

标签： python pandas binning

【解决方案1】：

这是pd.merge_asof：

(pd.merge_asof(df1.sort_values('Cumulative_Distance'),df2.sort_values('Section_Start'), 
               left_on='Cumulative_Distance', right_on='Section_Start',
               by='Runner_ID', allow_exact_matches=False)
   .sort_values(['Runner_ID','Cumulative_Distance'])
)

输出：

   Runner_ID  Cumulative_Distance  Section_Start Section_Name
3        Jen                  2.0            0.0        Flats
4        Jen                  2.3            0.0        Flats
5        Jen                  2.8            2.5       Uphill
8        Jen                  3.2            2.5       Uphill
9        Jen                  3.5            2.5       Uphill
11       Jen                  3.9            3.5        Curve
14       Jen                  4.8            3.5        Curve
15       Jen                  5.0            3.5        Curve
17       Jen                  5.3            5.0     Downhill
18       Jen                  5.8            5.0     Downhill
19       Jen                  6.0            5.0     Downhill
0       John                  1.0            0.0        Flats
1       John                  1.5            0.0        Flats
2       John                  1.8            0.0        Flats
6       John                  3.0            0.0        Flats
7       John                  3.2            3.0       Uphill
10      John                  3.7            3.0       Uphill
12      John                  4.0            3.0       Uphill
13      John                  4.3            3.0       Uphill
16      John                  5.0            3.0       Uphill
20      John                  6.6            5.0     Downhill
21      John                  8.0            5.0     Downhill

【讨论】：

非常感谢。不熟悉这个功能。处理非常大的数据集是否有效？
@GusRo Pandas 对于非常大的数据集效率不高。话虽这么说，这个功能是尽可能高效的，AFAICT。