【问题标题】:Optimization when searching for max value and percentage max using Python Pandas使用 Python Pandas 搜索最大值和百分比最大值时的优化
【发布时间】:2021-04-10 05:25:06
【问题描述】:

我有像下面这样的df

目标输出

我尝试了下面的代码,但它会得到一列的输出,我必须添加 for 循环才能获得整个结果

我有大数据,有什么快速的解决方案

data = {'item':["y1","y2","y3","y4","y5","y6","y7","y8","y9","y10"],
        'X1':  [1,1,1,1,1,7,7,7,5,4],
        'X2':  [8,9,10,10,10,8,8,10,8,9],
        'X3':  [11,12,13,11,11,11,11,11,1,2],
        }
df = pd.DataFrame(data, columns = ['item', 'X1','X2','X3'])
# get count of unique values 
df['X1'].nunique()
# get max Value
df['X1'].value_counts().idxmax()
# get percentage of max value 
df['X1'].value_counts().max()/df['X1'].size
# get Second value of Max Value
(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()
# Get Second Value of % 
df['X1'][df['X1']==(df.nlargest(2, ['X1'])['X1']).value_counts().idxmax()].size/df['X1'].size

【问题讨论】:

    标签: python pandas dataframe bigdata


    【解决方案1】:

    您可以为每个测试列以及最大和第二最大使用索引创建字典,因为Series.value_counts 默认排序:

    L = []
    cols = ['X1','X2','X3'] 
    
    for c in cols:
        u = df[c].nunique()
        a = df[c].value_counts()
        d = {'No of unique': u, 
             'Highest rep': a.index[0],
             '% of Highest rep': a.iat[0] / len(df),
             'Second Highest rep': a.index[1],
             'Second % of Highest rep': a.iat[1] / len(df)}
        L.append(d)
    
    
    df = pd.DataFrame(L, index=cols)    
    print (df)
        No of unique  Highest rep  % of Highest rep  Second Highest rep  \
    X1             4            1               0.5                   7   
    X2             3           10               0.4                   8   
    X3             5           11               0.6                  13   
    
        Second % of Highest rep  
    X1                      0.3  
    X2                      0.4  
    X3                      0.1 
    

    是否存在性最大值的更一般的解决方案测试:

    L = []
    cols = ['X1','X2','X3'] 
    
    for c in cols:
        u = df[c].nunique()
        a = df[c].value_counts()
        
        if len(a) > 1:
            secondmax = a.index[1]
            secondperc = a.iat[1] / len(df)
        else:
            secondmax = np.nan
            secondsecondperc = np.nan
            
        d = {'No of unique': u, 
             'Highest rep': a.index[0],
             '% of Highest rep': a.iat[0] / len(df),
             'Second Highest rep': secondmax,
             'Second % of Highest rep': secondperc}
    
             
        L.append(d)
    
    df = pd.DataFrame(L, index=cols) 
    

    【讨论】:

      猜你喜欢
      • 2021-07-09
      • 1970-01-01
      • 2017-05-17
      • 1970-01-01
      • 2016-02-20
      • 1970-01-01
      • 2021-04-07
      • 2020-12-05
      • 1970-01-01
      相关资源
      最近更新 更多