【问题标题】:Dataframe classification and sorting optimization problem 2Dataframe分类排序优化问题2
【发布时间】:2021-06-22 23:30:34
【问题描述】:

我之前问过一个排序问题,有人解决了,先在两列使用DataFrame.sort_values,然后加上GroupBy.head

Dataframe classification and sorting optimization problem

现在我遇到了更复杂的排序。我需要按category 对数据框进行分类。每个category在类的data2的值最大时,根据data1的值过滤,然后排序

代码如下,如何优化?

import numpy as np
import pandas as pd

df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100

a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)

b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)

df = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)

  category      data1      data2
0        A  28.194042  98.813271
1        A  26.635099  82.768130
2        A  24.345177  80.558532
3        A  24.222105  89.596726
4        B  60.883981  98.444699
5        B  49.934815  90.319787
6        B  10.751913  86.124271
7        B   4.029914  89.802120

我用的是groupby,感觉代码太复杂了,能不能优化一下?

import numpy as np
import pandas as pd

df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100

a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)

b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)

df2 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
df3 = df.groupby('category').apply(lambda x: x[x['data1'].isin(x[x['data1'] <= x[x['data2'] == x['data2'].max()].data1.max()]['data1'].nlargest(4))]).reset_index(drop=True)
df3 = df3.sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)

print((df2.data1-df3.data1).max())
print((df2.data2-df3.data2).max())

0.0
0.0

【问题讨论】:

    标签: python pandas dataframe numpy pandas-groupby


    【解决方案1】:

    用途:

    df = pd.DataFrame()
    n = 200
    df['category'] = np.random.choice(('A', 'B'), n)
    df['data1'] = np.random.rand(len(df))*100
    df['data2'] = np.random.rand(len(df))*100
    
    a = df[df['category'] == 'A']
    
    c = a[a['data2'] == a.data2.max()].data1.max()
    a = a[a['data1'] <= c]
    a = a.sort_values(by='data1', ascending=False).head(4)
    
    b = df[df['category'] == 'B']
    c = b[b['data2'] == b.data2.max()].data1.max()
    b = b[b['data1'] <= c]
    b = b.sort_values(by='data1', ascending=False).head(4)
    
    df1 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
    print(df1)
      category      data1      data2
    0        A  87.560430  99.262452
    1        A  85.798945  99.200321
    2        A  68.614311  97.796274
    3        A  41.641961  95.544980
    4        B  69.937691  99.711156
    5        B  56.932784  99.227111
    6        B  19.903620  94.389186
    7        B  12.701288  98.455274
    

    这里首先通过每个组的最大data2 获取所有data1,通过&lt;= 过滤,最后使用groupby.head

    s = (df.sort_values('data2')
           .drop_duplicates('category', keep='last')
           .set_index('category')['data1'])
    df = df[df['data1'] <= df['category'].map(s)]
    df1 = (df.sort_values(by=['category', 'data1'], ascending=[True, False])
             .groupby('category')
             .head(4)
             .reset_index(drop=True))
    print (df1)
      category      data1      data2
    0        A  87.560430  99.262452
    1        A  85.798945  99.200321
    2        A  68.614311  97.796274
    3        A  41.641961  95.544980
    4        B  69.937691  99.711156
    5        B  56.932784  99.227111
    6        B  12.701288  98.455274
    7        B  19.903620  94.389186
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-12-21
      • 1970-01-01
      • 1970-01-01
      • 2017-01-06
      • 2022-01-25
      • 2018-09-21
      相关资源
      最近更新 更多