【问题标题】:Python Pandas Group by c3 find max of column 2 and get column 1Python Pandas Group by c3 找到第 2 列的最大值并获取第 1 列
【发布时间】:2015-08-20 02:52:39
【问题描述】:

我尝试在 python 中反映一些复杂的 SQL 操作。从最初开始 - 要求是找出部门明智的获得最高薪水的 EMP_ID。 3个步骤:

  1. 分组(部门)

  2. Max(Salary) - 每个部门

  3. get(Emp_Id) - 每个部门

示例文件.csv

EMP_ID,NAME,AGE,ADDRESS,SAL,DEPT,LOC
1,ghk,3,PTBP,23,IME,bhmd
2,ghk,3,PTBP,23,IME,bhmd
3,ghk,3,PTBP,23,IME,bhmd
4,ghk,3,PTBP,23,IME-DATA,bhmd
5,ghk,3,PTBP,24,IME-DATA,bhmd
6,ghk,3,PTBP,23,IME,bhmd
7,ghk,3,PTBP,23,IME,bhmd
8,ghk,3,PTBP,29,IME-NA,bhmd
9,ghk,3,PTBP,23,IME,bhmd
10,ghk,3,PTBP,23,IME-NA,bhmd

我试过的代码:

import pandas as pd
from pandas import *
import numpy as np
from numpy import *
df=pd.read_csv("SAM_JOINS.csv",sep=",")
go=df["EMP_ID"]+df["AGE"]
df["SYSTEM_REVENUE"]=go
print (df)
b=df.groupby(["DEPT"],as_index=False)
gb1=b['DEPT'].agg({'Count':np.size})
print(gb1)

但未能明智地获得每个部门的 max(salary) 和 emp_id。 请在这方面帮助我,因为我是 python pandas 的新手。

【问题讨论】:

  • 你能发布你想要的输出吗,你的解释不清楚你想要什么,例如b.max()没有给你你想要的吗?

标签: python python-2.7 python-3.x numpy pandas


【解决方案1】:

您可以使用group.transform method

基本上,这一行:

df['DEPT_MAX_SAL'] = df.groupby('DEPT')['SAL'].transform(lambda x: x.max())

将部门最高工资放在每一行上,然后您所要做的就是从那里子集。我已经在您的数据中包含了实现的 IPython。请注意,由于您的示例数据在 SAL 字段上没有太多变化,因此该示例看起来不是特别干净。

IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: from StringIO import StringIO
   ...: import pandas as pd
   ...: 

In [2]: # Create data set for pandas to read from
   ...: data = """EMP_ID,NAME,AGE,ADDRESS,SAL,DEPT,LOC
   ...: 1,ghk,3,PTBP,23,IME,bhmd
   ...: 2,ghk,3,PTBP,23,IME,bhmd
   ...: 3,ghk,3,PTBP,23,IME,bhmd
   ...: 4,ghk,3,PTBP,23,IME-DATA,bhmd
   ...: 5,ghk,3,PTBP,24,IME-DATA,bhmd
   ...: 6,ghk,3,PTBP,23,IME,bhmd
   ...: 7,ghk,3,PTBP,23,IME,bhmd
   ...: 8,ghk,3,PTBP,29,IME-NA,bhmd
   ...: 9,ghk,3,PTBP,23,IME,bhmd
   ...: 10,ghk,3,PTBP,23,IME-NA,bhmd"""
   ...: data = StringIO(data)
   ...: 

In [3]: # Load dataset
   ...: df = pd.read_csv(data)
   ...: print df
   ...: 
   EMP_ID NAME  AGE ADDRESS  SAL      DEPT   LOC
0       1  ghk    3    PTBP   23       IME  bhmd
1       2  ghk    3    PTBP   23       IME  bhmd
2       3  ghk    3    PTBP   23       IME  bhmd
3       4  ghk    3    PTBP   23  IME-DATA  bhmd
4       5  ghk    3    PTBP   24  IME-DATA  bhmd
5       6  ghk    3    PTBP   23       IME  bhmd
6       7  ghk    3    PTBP   23       IME  bhmd
7       8  ghk    3    PTBP   29    IME-NA  bhmd
8       9  ghk    3    PTBP   23       IME  bhmd
9      10  ghk    3    PTBP   23    IME-NA  bhmd

In [4]: # Create new column of department max salary
   ...: df['DEPT_MAX_SAL'] = df.groupby('DEPT')['SAL'].transform(lambda x: x.max())
   ...: print df
   ...: 
   EMP_ID NAME  AGE ADDRESS  SAL      DEPT   LOC  DEPT_MAX_SAL
0       1  ghk    3    PTBP   23       IME  bhmd            23
1       2  ghk    3    PTBP   23       IME  bhmd            23
2       3  ghk    3    PTBP   23       IME  bhmd            23
3       4  ghk    3    PTBP   23  IME-DATA  bhmd            24
4       5  ghk    3    PTBP   24  IME-DATA  bhmd            24
5       6  ghk    3    PTBP   23       IME  bhmd            23
6       7  ghk    3    PTBP   23       IME  bhmd            23
7       8  ghk    3    PTBP   29    IME-NA  bhmd            29
8       9  ghk    3    PTBP   23       IME  bhmd            23
9      10  ghk    3    PTBP   23    IME-NA  bhmd            29

In [5]: # Subset to show only employees with max salary in department
   ...: print df[df['SAL'] == df['DEPT_MAX_SAL']]
   EMP_ID NAME  AGE ADDRESS  SAL      DEPT   LOC  DEPT_MAX_SAL
0       1  ghk    3    PTBP   23       IME  bhmd            23
1       2  ghk    3    PTBP   23       IME  bhmd            23
2       3  ghk    3    PTBP   23       IME  bhmd            23
4       5  ghk    3    PTBP   24  IME-DATA  bhmd            24
5       6  ghk    3    PTBP   23       IME  bhmd            23
6       7  ghk    3    PTBP   23       IME  bhmd            23
7       8  ghk    3    PTBP   29    IME-NA  bhmd            29
8       9  ghk    3    PTBP   23       IME  bhmd            23

【讨论】:

  • 它确实有效..假设如果公式是 x = a/b,其中 b 是分母,则该特定 dept_id 的计数。 a 是分子 i,e emp_id,基于每个部门的 max(salary)。这个计算应该在所有记录上。我快到了.. 只需要推一下.. 请帮帮我
  • 不幸的是,您的数据不够干净,无法轻松支持此操作。您的数据显示了在任何给定部门中具有最高薪水的多名员工,因此您需要指定您希望输出的样子,如果您想要重复记录或某种连接 id 的字段?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2020-09-12
  • 1970-01-01
  • 1970-01-01
  • 2018-11-02
  • 2015-11-24
  • 2020-06-09
  • 2022-01-20
相关资源
最近更新 更多