【问题标题】:Replace Iterations by elegant Pandas code用优雅的 Pandas 代码替换迭代
【发布时间】:2019-05-10 03:43:44
【问题描述】:

我正在尝试将我的旧学校代码更改为优雅/快速的 Pandas 代码,例如之前的问题之一:

Rolling operation slow performance to create a new column

我有 4 种不同的代码,我想通过使用 Pandas 的优雅而快速的代码来提高性能。

1) 所有类型的平均日期:

原始数据帧 (df) 与此类似(尽管要大得多):

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 1}, {'x': 5}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3},{'x': 1}, {'x': 6}, {'x': 3}, {'x': 5}, {'x': 2}, {'x': 3},{'x': 1}, {'x': 9}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3}, {'x': 6}, {'x': 8}, {'x': 2}, {'x': 7}, {'x': 9}]
df = pd.DataFrame(data, index=idx, columns=['x'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])

它看起来像这样:

                 x
date       type
2018-01-01 A     1
           B     5
2018-02-01 B     3
2018-03-01 A     2
           B     7
           C     3
2018-04-01 A     1
           B     6
2018-05-01 B     3
2018-06-01 A     5
           B     2
           C     3
2018-07-01 A     1
2018-08-01 B     9
           C     3
2018-09-01 A     2
           B     7
2018-10-01 C     3
           A     6
           B     8
2018-11-01 A     2
2018-12-01 B     7
           C     9

我的目标是改进这个缓慢的 for 循环代码。在我的代码下面:

df=df.reset_index()
df['y']=0
for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=np.mean(df['x'][df['date']==j])

结果如下:

         date type  x         y
0  2018-01-01    A  1  3.000000
1  2018-01-01    B  5  3.000000
2  2018-02-01    B  3  3.000000
3  2018-03-01    A  2  4.000000
4  2018-03-01    B  7  4.000000
5  2018-03-01    C  3  4.000000
6  2018-04-01    A  1  3.500000
7  2018-04-01    B  6  3.500000
8  2018-05-01    B  3  3.000000
9  2018-06-01    A  5  3.333333
10 2018-06-01    B  2  3.333333
11 2018-06-01    C  3  3.333333
12 2018-07-01    A  1  1.000000
13 2018-08-01    B  9  6.000000
14 2018-08-01    C  3  6.000000
15 2018-09-01    A  2  4.500000
16 2018-09-01    B  7  4.500000
17 2018-10-01    C  3  5.666667
18 2018-10-01    A  6  5.666667
19 2018-10-01    B  8  5.666667
20 2018-11-01    A  2  2.000000
21 2018-12-01    B  7  8.000000
22 2018-12-01    C  9  8.000000

** 我尝试了以下 Pandas 代码,但它不起作用(我仍然需要查看更多示例来了解它是如何工作的):

df['y'] = df.groupby('date')['x'].mean().reset_index(level=2, drop=True).swaplevel(0,1)

2) 所有类型的按日期观察(使用相同的数据框 (df)):

我的目标是衡量每个日期的类型数量。

我的慢代码是:

df=df.reset_index()
df['y']=0
for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=len(df['type'][df['date']==j])

结果如下:

         date type  x  y
0  2018-01-01    A  1  2
1  2018-01-01    B  5  2
2  2018-02-01    B  3  1
3  2018-03-01    A  2  3
4  2018-03-01    B  7  3
5  2018-03-01    C  3  3
6  2018-04-01    A  1  2
7  2018-04-01    B  6  2
8  2018-05-01    B  3  1
9  2018-06-01    A  5  3
10 2018-06-01    B  2  3
11 2018-06-01    C  3  3
12 2018-07-01    A  1  1
13 2018-08-01    B  9  2
14 2018-08-01    C  3  2
15 2018-09-01    A  2  2
16 2018-09-01    B  7  2
17 2018-10-01    C  3  3
18 2018-10-01    A  6  3
19 2018-10-01    B  8  3
20 2018-11-01    A  2  1
21 2018-12-01    B  7  2
22 2018-12-01    C  9  2

3) 'A' 类型的按日期观察(使用相同的数据框 (df)):

我的目标是测量每个日期的类型 A 的数量。

我的慢代码如下:

df=df.reset_index()
df['z']=0
df['y']=0

for index,row in df.iterrows():
    if row['type']=='A':
        df['z'][index]=1
    else:
        df['z'][index]=0

for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=sum(df['z'][df['date']==j])

del df['z']

它看起来像这样:

         date type  x  y
0  2018-01-01    A  1  1
1  2018-01-01    B  5  1
2  2018-02-01    B  3  0
3  2018-03-01    A  2  1
4  2018-03-01    B  7  1
5  2018-03-01    C  3  1
6  2018-04-01    A  1  1
7  2018-04-01    B  6  1
8  2018-05-01    B  3  0
9  2018-06-01    A  5  1
10 2018-06-01    B  2  1
11 2018-06-01    C  3  1
12 2018-07-01    A  1  1
13 2018-08-01    B  9  0
14 2018-08-01    C  3  0
15 2018-09-01    A  2  1
16 2018-09-01    B  7  1
17 2018-10-01    C  3  1
18 2018-10-01    A  6  1
19 2018-10-01    B  8  1
20 2018-11-01    A  2  1
21 2018-12-01    B  7  0
22 2018-12-01    C  9  0

4) 'A' 类型的观察结果,其 z 值 = 1:

请使用以下数据框(df1):

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 10, 'z': 1}, {'x': 50, 'z': 0}, {'x': 30, 'z': 0}, {'x': 20, 'z': 0}, {'x': 70, 'z': 1}, {'x': 30, 'z': 1},{'x': 10, 'z': 1}, {'x': 60, 'z': 0}, {'x': 30, 'z': 0}, {'x': 50, 'z': 1}, {'x': 20, 'z': 0}, {'x': 30, 'z': 1},{'x': 10, 'z': 0}, {'x': 90, 'z': 1}, {'x': 30, 'z': 1}, {'x': 20, 'z': 1}, {'x': 70, 'z': 0}, {'x': 30, 'z': 0}, {'x': 60, 'z': 1}, {'x': 80, 'z': 1}, {'x': 20, 'z': 0}, {'x': 70, 'z': 0}, {'x': 90, 'z': 1}]
df1 = pd.DataFrame(data, index=idx, columns=['x','z'])
df1.index.names=['date','type']
df1=df1.reset_index()
df1['date'] = pd.to_datetime(df1['date'],format = '%b-%y')
df1=df1.set_index(['date','type'])

此数据框 (df1) 如下所示:

                  x  z
date       type
2018-01-01 A     10  1
           B     50  0
2018-02-01 B     30  0
2018-03-01 A     20  0
           B     70  1
           C     30  1
2018-04-01 A     10  1
           B     60  0
2018-05-01 B     30  0
2018-06-01 A     50  1
           B     20  0
           C     30  1
2018-07-01 A     10  0
2018-08-01 B     90  1
           C     30  1
2018-09-01 A     20  1
           B     70  0
2018-10-01 C     30  0
           A     60  1
           B     80  1
2018-11-01 A     20  0
2018-12-01 B     70  0
           C     90  1

我的慢代码是:

df1=df1.reset_index()
df1['h']=0
df1['k']=0
df1['y']=0

for index,row in df1.iterrows():
    if row['type']=='A':
        df1['h'][index]=1
    else:
        df1['h'][index]=0

for index,row in df1.iterrows():
    if row['z']==1 and row['h']==1:
        df1['k'][index]=1
    else:
        df1['k'][index]=0   

for j in df1['date'].unique():
    list_1=list(df1['type'][df1['date']==j].index)
    df1['y'][list_1]=sum(df1['k'][df1['date']==j])

del df1['h']
del df1['k']

它看起来像这样:

         date type   x  z  y
0  2018-01-01    A  10  1  1
1  2018-01-01    B  50  0  1
2  2018-02-01    B  30  0  0
3  2018-03-01    A  20  0  0
4  2018-03-01    B  70  1  0
5  2018-03-01    C  30  1  0
6  2018-04-01    A  10  1  1
7  2018-04-01    B  60  0  1
8  2018-05-01    B  30  0  0
9  2018-06-01    A  50  1  1
10 2018-06-01    B  20  0  1
11 2018-06-01    C  30  1  1
12 2018-07-01    A  10  0  0
13 2018-08-01    B  90  1  0
14 2018-08-01    C  30  1  0
15 2018-09-01    A  20  1  1
16 2018-09-01    B  70  0  1
17 2018-10-01    C  30  0  1
18 2018-10-01    A  60  1  1
19 2018-10-01    B  80  1  1
20 2018-11-01    A  20  0  0
21 2018-12-01    B  70  0  0
22 2018-12-01    C  90  1  0

虽然这个问题看起来很长,但我知道 Pandas 的答案可以用几行代码来解决。如果您能告诉我您的代码比我的代码快多少,那将非常有用。

【问题讨论】:

    标签: python pandas performance numpy dataframe


    【解决方案1】:

    您正在寻找groupby + transform。这里.transform 是关键,因为它将结果广播回原始DataFrame 中属于该组的所有行。

    首先,我们可以简单地转换x

    对于唯一类型,将其带到一列比处理索引要快,因此assign一列并计算组内唯一值的#。

    对于最后两个条件,您可以创建一个布尔列来查看该行是否满足您在组内求和的条件。

    #1 Get the mean of `x` by date
    df['x_avg'] = df.groupby('date').x.transform('mean')
    
    #2 Get the # of unique types. 
    df['N'] = (df.assign(TYPE = df.index.get_level_values('type'))
                 .groupby('date').TYPE.transform('nunique'))
    
    #3 Get the number of Type == A within a group
    df['num_A']  = (df.assign(eqA = (df.index.get_level_values('type') == 'A'))
                      .groupby('date').eqA.transform(sum).astype(int))
    
    #4 Really just a slight extension of 3
    df1['cond_4']  = (df1.assign(to_sum = ((df1.index.get_level_values('type') == 'A')
                                           &  (df1.z == 1)).astype(int))
                         .groupby('date').to_sum.transform(sum))
    

    输出df:

                     x  num_A     x_avg  N
    date       type                       
    2018-01-01 A     1      1  3.000000  2
               B     5      1  3.000000  2
    2018-02-01 B     3      0  3.000000  1
    2018-03-01 A     2      1  4.000000  3
               B     7      1  4.000000  3
               C     3      1  4.000000  3
    2018-04-01 A     1      1  3.500000  2
               B     6      1  3.500000  2
    2018-05-01 B     3      0  3.000000  1
    2018-06-01 A     5      1  3.333333  3
               B     2      1  3.333333  3
               C     3      1  3.333333  3
    2018-07-01 A     1      1  1.000000  1
    2018-08-01 B     9      0  6.000000  2
               C     3      0  6.000000  2
    2018-09-01 A     2      1  4.500000  2
               B     7      1  4.500000  2
    2018-10-01 C     3      1  5.666667  3
               A     6      1  5.666667  3
               B     8      1  5.666667  3
    2018-11-01 A     2      1  2.000000  1
    2018-12-01 B     7      0  8.000000  2
               C     9      0  8.000000  2
    

    【讨论】:

    • 谢谢阿洛兹。 Pandas 的编码能力给我留下了深刻的印象。我需要开始弄清楚它是如何工作的。这个例子和答案非常有用。谢谢!
    【解决方案2】:

    对于第一种情况,您可以尝试以下方法:

    df['y'] = df.groupby('date')['x'].transform(np.mean)

    最后一种情况:

    df1 = df1.reset_index()
    df1['h'] = np.where(df1.reset_index()['type'] == 'A', 1, 0)
    df1['k'] = np.where((df1.h == 1) & (df1.z == 1), 1, 0)
    df1.y = df1.groupby('date')['k'].transform(np.sum)
    del df1['h']
    del df1['k']
    

    【讨论】:

      猜你喜欢
      • 2017-05-25
      • 2020-07-14
      • 2016-09-20
      • 2019-09-07
      • 2011-05-23
      • 2018-06-06
      • 1970-01-01
      • 1970-01-01
      • 2013-05-23
      相关资源
      最近更新 更多