在 Pandas 的测试数据框中查找数据的 z 分数答案

【问题标题】：Finding z-scores of data in a test dataframe in Pandas在 Pandas 的测试数据框中查找数据的 z 分数
【发布时间】：2018-02-07 12:05:24
【问题描述】：

我有分组数据，分为训练集和测试集。我正在寻找计算z-scores。在训练集上，这很容易，因为我可以使用内置函数来计算均值和标准差。

这是一个示例，我正在按位置查找 z 分数：将熊猫导入为 pd 将 numpy 导入为 np # 我的示例数据框

train = pd.DataFrame({'place':     ['Winterfell','Winterfell','Winterfell','Winterfell','Dorne', 'Dorne','Dorne'],
                      'temp' : [ 23 , 10 , 0 , -32, 90, 110, 100 ]})
test  = pd.DataFrame({'place': ['Winterfell', 'Winterfell', 'Dorne'],
                      'temp' : [6, -8, 100]})

# get the z-scores by group for the training set
train.loc[: , 'z' ] = train.groupby('place')['temp'].transform(lambda x: (x - x.mean()) / x.std())

现在训练数据框采用以下形式：

|    Place   | temp |   z   |
|------------|------|-------|
| Winterfell |    23| 0.969 |
| Winterfell |    10| 0.415 |
| Winterfell |     0|-0.011 |
| Winterfell |   -32|-1.374 |
|      Dorne |    90| 1.000 |
|      Dorne |   110|-1.000 |
|      Dorne |   100| 0.000 |

这就是我想要的。

问题是我现在想使用训练集的均值和标准差来计算测试集中的 z 分数。我可以很容易地得到平均值和标准差：

summary = train.groupby('place').agg({'temp' : [np.mean, np.std]} ).xs('temp',axis=1,drop_level=True)

print(summary)

          mean        std
place                        
Dorne       100.00  10.000000
Winterfell    0.25  23.471614

我有一些复杂的方法来做我想做的事，但由于这是我必须经常做的任务，所以我正在寻找一种整洁的方法来做这件事。到目前为止，这是我尝试过的：

从汇总表中制作字典dict，我可以在其中将均值和标准偏差提取为元组。然后在测试集上，我可以申请：
```
test.loc[: , 'z'] = test.apply(lambda row: (row.temp - dict[row.place][0]) / dict[row.place][1] ,axis = 1)
```

为什么我不喜欢它：

字典难读，需要知道dict的结构是什么。
如果某个地方出现在测试集中但没有出现在训练集中，代码不会得到 NaN，而是会抛出错误。
1. 使用索引
```
test.set_index('place', inplace = True)
test.loc[:, 'z'] = (test['temp'] - summary['mean'])/summary['std']
```

为什么我不喜欢它： - 看起来它应该可以工作，但只给我 NaNs

最终结果应该是有没有一种标准的pythonic方式来做这种组合？

【问题讨论】：

这个答案可能对你有帮助：stackoverflow.com/questions/24761998/…
谢谢！我在编写我的解决方案时看到了那个，尽管那个专注于从数据帧中的数据计算 z 分数，而不是使用来自单独数据帧的方法。不过，时间序列示例几乎可以满足我的要求。

标签： python pandas dataframe

【解决方案1】：

选项1
pd.Series.map

test.assign(z=
    (test.temp - test.place.map(summary['mean'])) / test.place.map(summary['std'])
)

        place  temp         z
0  Winterfell     6  0.244977
1  Winterfell    -8 -0.351488
2       Dorne   100  0.000000

选项 2
pd.DataFrame.eval

test.assign(z=
    test.join(summary, on='place').eval('(temp - mean) / std')
)

        place  temp         z
0  Winterfell     6  0.244977
1  Winterfell    -8 -0.351488
2       Dorne   100  0.000000

【讨论】：

谢谢！这正是我想要的。
选项 2 是 assign 和 eval 在临时连接上的有趣应用。
是的，选项 2 很漂亮！