【问题标题】:Replace grouped columns' outliers with mean of the group based on defined zscore根据定义的 zscore 将分组列的异常值替换为组的平均值
【发布时间】:2021-02-15 06:10:10
【问题描述】:

我有一个非常大的数据框,地图上有许多数据点,数据集上的异常值彼此非常接近(纬度和经度)。我想对 A 列的所有行进行分组,如下所示,计算它们的 zscore,并将 zscore > 1.5 的组中的每个值替换为该组的平均值。

df =

[data][1]

我尝试了 zscore 值表但没有成功

<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>

【问题讨论】:

  • 嗨,不清楚您要在哪一列(或几列)计算zscore:它是在具有相同“A”标签的点之间的距离上吗?是单独在latlon 上吗?
  • 是的。它是在对列 A[Like] df.groupby("A") 的值进行分组后,在 lat 和 lon 上具有相同标签的点之间的距离,但 zscore 计算来自 lat 和 lon

标签: python dataframe data-science


【解决方案1】:

您可以使用scikit-learn 中的haversine_distances 来计算同一组中某个点与该点的质心之间的距离。鉴于您应该有非常接近的点,您可以使用组中点的纬度和经度的平均值来近似质心的纬度和经度。

这里是一个示例,基于来自英国城镇的数据(您可以从here 下载免费示例)。特别是,数据包含每个城市的坐标和县(您可以在设置中将其视为一个组):

                          name          county  latitude  longitude
0                 Aaron's Hill          Surrey  51.18291   -0.63098
1                  Abbas Combe        Somerset  51.00283   -2.41825
2                     Abberley  Worcestershire  52.30522   -2.37574
3                     Abberton           Essex  51.83440    0.91066
4                     Abberton  Worcestershire  52.17955   -2.00817
5                    Abberwick  Northumberland  55.41325   -1.79720
6                   Abbess End           Essex  51.78000    0.28172
7                Abbess Roding           Essex  51.77815    0.27685
8                        Abbey           Devon  50.88896   -3.22276
9  Abbeycwmhir / Abaty Cwm-hir           Powys  52.33104   -3.38988

这里要更改代码以解决您的问题:

from math import radians

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances

df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])

# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))

# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)

# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
    lambda x: haversine_distances(
        [x[['latitude_radians', 'longitude_radians']].values],
        [dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
    )[0][0] * 6371000/1000,  # multiply by Earth radius to get kilometers,
    axis=1
)

# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})

# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
    lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
    axis=1
)

# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
    dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values

生成的 DataFrame df 如下所示:

              name           county  latitude  longitude  latitude_radians  longitude_radians       dist    zscore
0     Aaron's Hill           Surrey  51.18291   -0.63098          0.893310          -0.011013  12.479147 -0.293419
1      Abbas Combe         Somerset  51.00283   -2.41825          0.890167          -0.042206  35.205157  1.088695
2         Abberley   Worcestershire  52.30522   -2.37574          0.912898          -0.041464  17.014249  0.266168
3         Abberton            Essex  51.83440    0.91066          0.904681           0.015894  24.504285 -0.254400
4         Abberton   Worcestershire  52.17955   -2.00817          0.910705          -0.035049  11.906150 -0.663460
...            ...              ...       ...        ...               ...                ...        ...       ...
1795         Ayton     Berwickshire  55.84232   -2.12285          0.974632          -0.037051   5.899085  0.007876
1796         Ayton    Tyne and Wear  54.89416   -1.55643          0.958084          -0.027165   3.192591 -0.935937

如果您查看埃塞克斯县的异常值,新坐标对应于质心的坐标,即 (51.846594, 0.554532):

             name county   latitude  longitude
414   Aimes Green  Essex  51.846594   0.554532
1721       Aveley  Essex  51.846594   0.554532

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-03-18
    • 1970-01-01
    • 1970-01-01
    • 2013-09-12
    • 2023-02-21
    • 2019-05-01
    相关资源
    最近更新 更多