根据定义的 zscore 将分组列的异常值替换为组的平均值答案

【问题标题】：Replace grouped columns' outliers with mean of the group based on defined zscore根据定义的 zscore 将分组列的异常值替换为组的平均值
【发布时间】：2021-02-15 06:10:10
【问题描述】：

我有一个非常大的数据框，地图上有许多数据点，数据集上的异常值彼此非常接近（纬度和经度）。我想对 A 列的所有行进行分组，如下所示，计算它们的 zscore，并将 zscore > 1.5 的组中的每个值替换为该组的平均值。

df =

[data][1]

我尝试了 zscore 值表但没有成功

<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>

【问题讨论】：

嗨，不清楚您要在哪一列（或几列）计算zscore：它是在具有相同“A”标签的点之间的距离上吗？是单独在lat 和lon 上吗？
是的。它是在对列 A[Like] df.groupby("A") 的值进行分组后，在 lat 和 lon 上具有相同标签的点之间的距离，但 zscore 计算来自 lat 和 lon

标签： python dataframe data-science

【解决方案1】：

您可以使用scikit-learn 中的haversine_distances 来计算同一组中某个点与该点的质心之间的距离。鉴于您应该有非常接近的点，您可以使用组中点的纬度和经度的平均值来近似质心的纬度和经度。

这里是一个示例，基于来自英国城镇的数据（您可以从here 下载免费示例）。特别是，数据包含每个城市的坐标和县（您可以在设置中将其视为一个组）：

                          name          county  latitude  longitude
0                 Aaron's Hill          Surrey  51.18291   -0.63098
1                  Abbas Combe        Somerset  51.00283   -2.41825
2                     Abberley  Worcestershire  52.30522   -2.37574
3                     Abberton           Essex  51.83440    0.91066
4                     Abberton  Worcestershire  52.17955   -2.00817
5                    Abberwick  Northumberland  55.41325   -1.79720
6                   Abbess End           Essex  51.78000    0.28172
7                Abbess Roding           Essex  51.77815    0.27685
8                        Abbey           Devon  50.88896   -3.22276
9  Abbeycwmhir / Abaty Cwm-hir           Powys  52.33104   -3.38988

这里要更改代码以解决您的问题：

from math import radians

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances

df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])

# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))

# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)

# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
    lambda x: haversine_distances(
        [x[['latitude_radians', 'longitude_radians']].values],
        [dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
    )[0][0] * 6371000/1000,  # multiply by Earth radius to get kilometers,
    axis=1
)

# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})

# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
    lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
    axis=1
)

# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
    dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values

生成的 DataFrame df 如下所示：

              name           county  latitude  longitude  latitude_radians  longitude_radians       dist    zscore
0     Aaron's Hill           Surrey  51.18291   -0.63098          0.893310          -0.011013  12.479147 -0.293419
1      Abbas Combe         Somerset  51.00283   -2.41825          0.890167          -0.042206  35.205157  1.088695
2         Abberley   Worcestershire  52.30522   -2.37574          0.912898          -0.041464  17.014249  0.266168
3         Abberton            Essex  51.83440    0.91066          0.904681           0.015894  24.504285 -0.254400
4         Abberton   Worcestershire  52.17955   -2.00817          0.910705          -0.035049  11.906150 -0.663460
...            ...              ...       ...        ...               ...                ...        ...       ...
1795         Ayton     Berwickshire  55.84232   -2.12285          0.974632          -0.037051   5.899085  0.007876
1796         Ayton    Tyne and Wear  54.89416   -1.55643          0.958084          -0.027165   3.192591 -0.935937

如果您查看埃塞克斯县的异常值，新坐标对应于质心的坐标，即 (51.846594, 0.554532)：

             name county   latitude  longitude
414   Aimes Green  Essex  51.846594   0.554532
1721       Aveley  Essex  51.846594   0.554532

【讨论】：