将 DataFrame 中的 NA 替换为每个国家/地区的平均值答案

【问题标题】：Replace NA in DataFrame for multiple columns with mean per country将 DataFrame 中的 NA 替换为每个国家/地区的平均值
【发布时间】：2022-01-16 07:20:44
【问题描述】：

我想用同一年份的其他列的平均值替换 NA 值。

注意：
为了替换加拿大数据的 NA 值，我只想使用加拿大的平均值，而不是整个数据集的平均值当然。

这是一个用随机数填充的示例数据框。还有一些 NA 我如何在我的数据框中找到它们：

Country	Inhabitants	Year	Area	Cats	Dogs
Canada	38 000 000	2021	4	32	21
Canada	37 000 000	2020	4	NA	21
Canada	36 000 000	2019	3	32	21
Canada	NA	2018	2	32	21
Canada	34 000 000	2017	NA	32	21
Canada	35 000 000	2016	3	32	NA
Brazil	212 000 000	2021	5	32	21
Brazil	211 000 000	2020	4	NA	21
Brazil	210 000 000	2019	NA	32	21
Brazil	209 000 000	2018	4	32	21
Brazil	NA	2017	2	32	21
Brazil	207 000 000	2016	4	32	NA

使用 pandas 用其他年份的平均值替换那些 NA 的最简单方法是什么？是否可以编写一个可以遍历每个 NA 并替换它们的代码（居民、区域、猫、狗一次）？

【问题讨论】：

能否提供包含数据框的代码？
数据（excel）是：happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls

标签： python pandas dataframe nan

【解决方案1】：

注意 示例基于您来自 cmets 的附加数据源

用mean()替换多列的NA-Values可以结合以下三种方法：

fillna() （每列迭代axis应该是0，这是fillna()的默认值）
groupby()
transform()

根据您的示例创建数据框：

df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')

Country name	year	Life Ladder	Log GDP per capita	Social support	Healthy life expectancy at birth	Freedom to make life choices	Generosity	Perceptions of corruption	Positive affect	Negative affect
Canada	2005	7.41805	10.6518	0.961552	71.3	0.957306	0.25623	0.502681	0.838544	0.233278
Canada	2007	7.48175	10.7392	nan	71.66	0.930341	0.249479	0.405608	0.871604	0.25681
Canada	2008	7.4856	10.7384	0.938707	71.84	0.926315	0.261585	0.369588	0.89022	0.202175
Canada	2009	7.48782	10.6972	0.942845	72.02	0.915058	0.246217	0.412622	0.867433	0.247633
Canada	2010	7.65035	10.7165	0.953765	72.2	0.933949	0.230451	0.41266	0.878868	0.233113

调用`fillna()` 并遍历按国家名称分组的所有列：

df = df.fillna(df.groupby('Country name').transform('mean'))

检查您在加拿大的成绩：

df[df['Country name'] == 'Canada']

Country name	year	Life Ladder	Log GDP per capita	Social support	Healthy life expectancy at birth	Freedom to make life choices	Generosity	Perceptions of corruption	Positive affect	Negative affect
Canada	2005	7.41805	10.6518	0.961552	71.3	0.957306	0.25623	0.502681	0.838544	0.233278
Canada	2007	7.48175	10.7392	0.93547	71.66	0.930341	0.249479	0.405608	0.871604	0.25681
Canada	2008	7.4856	10.7384	0.938707	71.84	0.926315	0.261585	0.369588	0.89022	0.202175
Canada	2009	7.48782	10.6972	0.942845	72.02	0.915058	0.246217	0.412622	0.867433	0.247633
Canada	2010	7.65035	10.7165	0.953765	72.2	0.933949	0.230451	0.41266	0.878868	0.233113

【讨论】：

【解决方案2】：

这也有效：

在 [2] 中：

df = pd.read_excel('DataPanelWHR2021C2.xls')

在 [3] 中：

# Check for number of null values in df
df.isnull().sum()

出[3]：

Country name                          0
year                                  0
Life Ladder                           0
Log GDP per capita                   36
Social support                       13
Healthy life expectancy at birth     55
Freedom to make life choices         32
Generosity                           89
Perceptions of corruption           110
Positive affect                      22
Negative affect                      16
dtype: int64

解决方案

在 [4] 中：

# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)

在 [5] 中：

# 2nd check for number of null values
df.isnull().sum()

Out [5]：不再有 NULL 值

Country name                        0
year                                0
Life Ladder                         0
Log GDP per capita                  0
Social support                      0
Healthy life expectancy at birth    0
Freedom to make life choices        0
Generosity                          0
Perceptions of corruption           0
Positive affect                     0
Negative affect                     0
dtype: int64

【讨论】：

根据您的示例创建数据框：

调用fillna() 并遍历按国家名称分组的所有列：

检查您在加拿大的成绩：

调用`fillna()` 并遍历按国家名称分组的所有列：