【发布时间】:2021-06-22 23:30:34
【问题描述】:
我之前问过一个排序问题,有人解决了,先在两列使用DataFrame.sort_values,然后加上GroupBy.head。
Dataframe classification and sorting optimization problem
现在我遇到了更复杂的排序。我需要按category 对数据框进行分类。每个category在类的data2的值最大时,根据data1的值过滤,然后排序
代码如下,如何优化?
import numpy as np
import pandas as pd
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100
a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)
df = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)
category data1 data2
0 A 28.194042 98.813271
1 A 26.635099 82.768130
2 A 24.345177 80.558532
3 A 24.222105 89.596726
4 B 60.883981 98.444699
5 B 49.934815 90.319787
6 B 10.751913 86.124271
7 B 4.029914 89.802120
我用的是groupby,感觉代码太复杂了,能不能优化一下?
import numpy as np
import pandas as pd
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100
a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)
df2 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
df3 = df.groupby('category').apply(lambda x: x[x['data1'].isin(x[x['data1'] <= x[x['data2'] == x['data2'].max()].data1.max()]['data1'].nlargest(4))]).reset_index(drop=True)
df3 = df3.sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print((df2.data1-df3.data1).max())
print((df2.data2-df3.data2).max())
0.0
0.0
【问题讨论】:
标签: python pandas dataframe numpy pandas-groupby