【问题标题】:Faster method of standardizing DF标准化 DF 的更快方法
【发布时间】:2021-08-23 07:53:01
【问题描述】:

我有一个包含大约 3000 个变量和 14000 个数据点的 df。

我需要在组内和 df 内标准化 df,总共创建 6000 个变量。

我目前的实现如下:

col_names = df.columns.to_list()
col_names.remove('id')
for col in col_name_test:
    df[col + '_id'] = df.groupby('id')[col].transform(lambda x: (x - x.mean())/x.std())
    df[col] = (df[col] - df[col].mean())/ df[col].std()

上面的代码需要很长时间才能运行。

分别计算这两个操作的平均速度表明 groupby-transform 明显更慢。

这是一个简单的例子 df 和所需的输出。

dic = {'id': [1,1,1, 2,2,2, 3,3,3,3,3, 4,4,4,4,4 ,5,5,5,5,], 'a': [3,4,2,5,6,7,5,4,3,5,7,5,2,4,8,6,2,3,4,6], 'b': [12,32,21,14,52,62,12,34,52,74,2,34,54,12,45,75,54,23,12,32]}

df = pd.DataFrame(dic)

col_names = df.columns.to_list()
    
col_names.remove('id')
    
for col in col_names:
    df[col+'_id'] = df.groupby('id')[col].transform(lambda x: (x-x.mean())/x.std())
    df[col] = (df[col] - df[col].mean())/ df[col].std()

    id         a         b      a_id      b_id
0    1 -0.879967 -1.060367  0.000000 -0.965060
1    1 -0.312247 -0.154070  1.000000  1.031615
2    1 -1.447688 -0.652533 -1.000000 -0.066556
3    2  0.255474 -0.969737 -1.000000 -1.131971
4    2  0.823195  0.752226  0.000000  0.368549
5    2  1.390916  1.205374  1.000000  0.763422
6    3  0.255474 -1.060367  0.134840 -0.778742
7    3 -0.312247 -0.063441 -0.539360 -0.027324
8    3 -0.879967  0.752226 -1.213560  0.587472
9    3  0.255474  1.749152  0.134840  1.338890
10   3  1.390916 -1.513515  1.483240 -1.120296
11   4  0.255474 -0.063441  0.000000 -0.427765
12   4 -1.447688  0.842856 -1.341641  0.427765
13   4 -0.312247 -1.060367 -0.447214 -1.368847
14   4  1.958637  0.435022  1.341641  0.042776
15   4  0.823195  1.794467  0.447214  1.326070
16   5 -1.447688  0.842856 -1.024695  1.332707
17   5 -0.879967 -0.561904 -0.439155 -0.406826
18   5 -0.312247 -1.060367  0.146385 -1.024080
19   5  0.823195 -0.154070  1.317465  0.098199

【问题讨论】:

  • 这里的解决方法是这里不使用lambda

标签: python pandas dataframe optimization standardization


【解决方案1】:

尝试不使用 for 循环:

df[[x+'_id' for x in col_names]]=df.groupby('id')[col_names].transform(lambda x: (x - x.mean())/x.std())

df[col_names] = (df[col_names] - df[col_names].mean())/ df[col_names].std()

df 的输出:

    id         a         b      a_id      b_id
0    1 -0.879967 -1.060367  0.000000 -0.965060
1    1 -0.312247 -0.154070  1.000000  1.031615
2    1 -1.447688 -0.652533 -1.000000 -0.066556
3    2  0.255474 -0.969737 -1.000000 -1.131971
4    2  0.823195  0.752226  0.000000  0.368549
5    2  1.390916  1.205374  1.000000  0.763422
6    3  0.255474 -1.060367  0.134840 -0.778742
7    3 -0.312247 -0.063441 -0.539360 -0.027324
8    3 -0.879967  0.752226 -1.213560  0.587472
9    3  0.255474  1.749152  0.134840  1.338890
10   3  1.390916 -1.513515  1.483240 -1.120296
11   4  0.255474 -0.063441  0.000000 -0.427765
12   4 -1.447688  0.842856 -1.341641  0.427765
13   4 -0.312247 -1.060367 -0.447214 -1.368847
14   4  1.958637  0.435022  1.341641  0.042776
15   4  0.823195  1.794467  0.447214  1.326070
16   5 -1.447688  0.842856 -1.024695  1.332707
17   5 -0.879967 -0.561904 -0.439155 -0.406826
18   5 -0.312247 -1.060367  0.146385 -1.024080
19   5  0.823195 -0.154070  1.317465  0.098199

【讨论】:

  • 切换步骤
【解决方案2】:

尝试set_index 和数学运算来规范化帧,并尝试groupby transform + add_suffix 来规范化组,然后一起使用concat

new_df = df.set_index('id')
new_df = pd.concat((
    (new_df - new_df.mean()) / new_df.std(),
    new_df.groupby(level=0).transform(lambda x: (x - x.mean()) / x.std())
        .add_suffix('_id')
), axis=1).reset_index()

new_df:

    id         a         b      a_id      b_id
0    1 -0.879967 -1.060367  0.000000 -0.965060
1    1 -0.312247 -0.154070  1.000000  1.031615
2    1 -1.447688 -0.652533 -1.000000 -0.066556
3    2  0.255474 -0.969737 -1.000000 -1.131971
4    2  0.823195  0.752226  0.000000  0.368549
5    2  1.390916  1.205374  1.000000  0.763422
6    3  0.255474 -1.060367  0.134840 -0.778742
7    3 -0.312247 -0.063441 -0.539360 -0.027324
8    3 -0.879967  0.752226 -1.213560  0.587472
9    3  0.255474  1.749152  0.134840  1.338890
10   3  1.390916 -1.513515  1.483240 -1.120296
11   4  0.255474 -0.063441  0.000000 -0.427765
12   4 -1.447688  0.842856 -1.341641  0.427765
13   4 -0.312247 -1.060367 -0.447214 -1.368847
14   4  1.958637  0.435022  1.341641  0.042776
15   4  0.823195  1.794467  0.447214  1.326070
16   5 -1.447688  0.842856 -1.024695  1.332707
17   5 -0.879967 -0.561904 -0.439155 -0.406826
18   5 -0.312247 -1.060367  0.146385 -1.024080
19   5  0.823195 -0.154070  1.317465  0.098199

【讨论】:

    【解决方案3】:

    df.sub 也接受一个级别参数。考虑到同样的情况,我们也可以尝试以下方法:

    g = df.groupby("id")[col_names]
    u = df.set_index("id")[col_names].sub(g.mean(),level=0).div(g.std())
    
    out = ((df[col_names]-df[col_names].mean()).div(df[col_names].std())
            .assign(**u.add_suffix("_id").reset_index()))
    

    print(out)
    
               a         b  id      a_id      b_id
    0  -0.879967 -1.060367   1  0.000000 -0.965060
    1  -0.312247 -0.154070   1  1.000000  1.031615
    2  -1.447688 -0.652533   1 -1.000000 -0.066556
    3   0.255474 -0.969737   2 -1.000000 -1.131971
    4   0.823195  0.752226   2  0.000000  0.368549
    5   1.390916  1.205374   2  1.000000  0.763422
    6   0.255474 -1.060367   3  0.134840 -0.778742
    7  -0.312247 -0.063441   3 -0.539360 -0.027324
    8  -0.879967  0.752226   3 -1.213560  0.587472
    9   0.255474  1.749152   3  0.134840  1.338890
    10  1.390916 -1.513515   3  1.483240 -1.120296
    11  0.255474 -0.063441   4  0.000000 -0.427765
    12 -1.447688  0.842856   4 -1.341641  0.427765
    13 -0.312247 -1.060367   4 -0.447214 -1.368847
    14  1.958637  0.435022   4  1.341641  0.042776
    15  0.823195  1.794467   4  0.447214  1.326070
    16 -1.447688  0.842856   5 -1.024695  1.332707
    17 -0.879967 -0.561904   5 -0.439155 -0.406826
    18 -0.312247 -1.060367   5  0.146385 -1.024080
    19  0.823195 -0.154070   5  1.317465  0.098199
    

    【讨论】:

    • 我想您忘记将id 列添加回来。
    • @HenryEcker 谢谢,快速修复
    猜你喜欢
    • 2019-11-24
    • 2021-10-03
    • 2017-11-30
    • 1970-01-01
    • 2017-06-14
    • 2015-10-15
    • 2019-06-29
    • 1970-01-01
    • 2020-07-31
    相关资源
    最近更新 更多