【发布时间】:2017-10-17 06:36:21
【问题描述】:
我有一个排名功能,我将其应用于数百万行的大量列,这需要几分钟才能运行。通过删除为应用 .rank( 方法准备数据的所有逻辑,即通过执行以下操作:
ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))
我设法把它缩短到几秒钟。但是,我需要保留我的逻辑,并且正在努力重组我的代码:最终,最大的瓶颈是我对 lambda x: 的双重使用,但显然其他方面正在减慢速度(见下文)。我提供了一个示例数据框,以及下面的排名函数,即 MCVE。总的来说,我认为我的问题可以归结为:
(i) 如何将代码中的.apply(lambda x 用法替换为快速的矢量化等效项? (ii) 如何遍历多索引、分组数据帧并应用函数?在我的例子中,对于 date_id 和 category 列的每个唯一组合。
(iii) 我还能做些什么来加快我的排名逻辑?主要开销似乎在.value_counts()。这与上述 (i) 重叠;在发送排名之前,也许可以通过构建临时列来在 df 上完成大部分逻辑。同样,可以在一次调用中对子数据帧进行排名吗?
(iv) 为什么使用pd.qcut() 而不是df.rank()?后者是cythonized,似乎对领带的处理更灵活,但我看不到两者之间的比较,pd.qcut() 似乎使用最广泛。
示例输入数据如下:
import pandas as pd
import numpy as np
import random
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
两个排名函数是:
def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
#extra data tidying logic here beyond scope of question - can remove
ranked = df[to_rank].apply(lambda x: f(x))
return ranked
def f(x):
nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
sub_df = x.dropna() #
nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50
if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50
return nans_ranked
if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
sub_df[:] = 50
return sub_df
#Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
max_bins = len(sub_df) / max_cluster
if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
max_bins = 100
if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
sub_df[:] = 50
return sub_df
bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice
sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
ranked_df = pd.concat([sub_df_ranked, nans_ranked])
return ranked_df
调用我的排名函数并与df重组的代码是:
# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
df = df.join(ranked[ranked_cols])
我试图通过删除两个 lambda x 调用,尽可能快地获得此排名逻辑;我可以删除 rank_fun 中的逻辑,以便只有 f(x) 的逻辑适用,但我也不知道如何以矢量化方式处理多索引数据帧。另一个问题是关于pd.qcut( 和df.rank( 之间的差异:似乎两者都有不同的处理关系的方式,但开销似乎相似,尽管事实上 .rank( 被cythonized;考虑到这可能会产生误导主要开销是由于我使用了 lambda x。
我在f(x) 上运行%lprun 得到了以下结果,尽管主要开销是使用.apply(lambda x 而不是矢量化方法:
Line # Hits Time Per Hit % Time Line Contents
2 def tst_fun(df, field):
3 1 685 685.0 0.2 x = df[field]
4 1 20726 20726.0 5.8 nans = x[np.isnan(x)]
5 1 28448 28448.0 8.0 sub_df = x.dropna()
6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50)
7 1 5 5.0 0.0 if len(sub_df.index) == 0:
8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
9 return nans_ranked
10
11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1:
12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13 return sub_df
14
15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17 # print(counts)
18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster #
19
20 1 3 3.0 0.0 if max_bins > 100:
21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22
23
24 1 0 0.0 0.0 if max_bins < 5:
25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26
27 # return sub_df
28
29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30
31 #should track bin resolution for all data. To add.
32
33 #if get here, then neither nans_ranked, nor sub_df are empty
34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36
37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38
39 1 1 1.0 0.0 return ranked_df
【问题讨论】:
-
您是否考虑过使用多处理来更快地运行 lambda 语句?我不知道 pandas 处理多处理/多线程的能力如何,但我认为你应该试一试。
-
谢谢,这是一个有趣的想法。尽管如此,它必须可以矢量化我的“循环”!
-
Numba 或许能够矢量化您的排名函数。
-
我没有花足够的时间来获得一个很好的答案,但是您是否尝试将数据放入可以并行运行的列中,然后将这些值传递给矢量化函数,比如
bn.nanrankdata?这样,您无需调用 pythonn次,您可以留在 C 代码中。但这取决于能否拥有一个可以原子地在每一列上运行的函数。你能做到吗? -
我不太清楚,但如果它像
map那样工作,也许不将函数封闭在 lambda 上会运行得更快,ranked = df[to_rank].apply(f)
标签: python pandas lambda vectorization ranking