如何在r中按组填充NA的平均值？答案

【问题标题】：How to fill mean for NAs in column by groups in r?如何在r中按组填充NA的平均值？
【发布时间】：2021-05-13 09:26:14
【问题描述】：

我有一个包含多个 NA 的数据集，我想为每列取平均值并按特定组填充 Nas，我的数据集如下所示

PID Category    column1 column2 column3
123    1             54    2.4  NA
324    1             52    NA   21.1
356    1             NA    3.6  25.6
378    2             56    3.2  NA
395    2             NA    3.5  29.9
362    2             45    NA   24.3
789    3             65   12.6  23.8
759    3             66    NA   26.8
762    3             NA    NA   27.2
741    3             69   8.5   23.3

我需要想要的输出

PID Category    column1 column2 column3
123    1             54   2.4   23.3
324    1             52   3.0   21.1
356    1             53   3.6   25.6
378    2             56   3.2   27.1
395    2             50.5 3.5   29.9
362    2             61.3 3.3   24.3
789    3             65   12.6  23.8
759    3             66   10.5  26.8
762    3             66.6 10.5  27.2
741    3             69   8.5   23.3

谢谢

【问题讨论】：

NA填充的逻辑是什么，在column1你有值54和52但是NA被替换为61.3？

标签： r if-statement dplyr

【解决方案1】：

你可以使用：

library(dplyr)

df %>%
  group_by(Category) %>%
  mutate(across(starts_with('column'), 
                ~replace(., is.na(.), mean(., na.rm = TRUE)))) %>%
  ungroup

#     PID Category column1 column2 column3
#   <int>    <int>   <dbl>   <dbl>   <dbl>
# 1   123        1    54      2.4     23.4
# 2   324        1    52      3       21.1
# 3   356        1    53      3.6     25.6
# 4   378        2    56      3.2     27.1
# 5   395        2    50.5    3.5     29.9
# 6   362        2    45      3.35    24.3
# 7   789        3    65     12.6     23.8
# 8   759        3    66     10.6     26.8
# 9   762        3    66.7   10.6     27.2
#10   741        3    69      8.5     23.3

【讨论】：

要得到和OP一样想要的输出，需要加group_by(Category)吗？
没错。 OP 有一个不同版本的早期输出，其未分组。感谢并祝贺您获得 20k ;-)

【解决方案2】：

我们可以使用zoo中的na.aggregate，默认情况下，它将相关列的NA替换为mean

library(dplyr)
library(zoo)
df1 %>%
   group_by(Category) %>%
   mutate(across(starts_with('column'), na.aggregate)) %>%
   ungroup

或者使用group_modify 和na.aggregate 作为@G。格洛腾迪克在 cmets 中建议

df1 %>% 
  group_by(Category) %>% 
  group_modify(na.aggregate) %>%
  ungroup

或使用data.table

library(data.table)
nm1 <- grep("^column\\d+$", names(df1), value = TRUE)
setDT(df1)[, (nm1) := na.aggregate(.SD), by = Category, .SDcols = nm1]

或者base R

unsplit(lapply(split(df1, df1$Category), na.aggregate), df1$Category)

【讨论】：

或者因为所有列都是数字，而其他列没有任何 NA：df1 %>% group_by(Category) %>% group_modify(na.aggregate) %>% ungroup

【解决方案3】：

另一个data.table 选项

cbind(
  setDT(df)[, "PID"],
  df[,
    lapply(
      .SD,
      function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
    ), Category,
    .SDcols = patterns("^column")
  ]
)

给予

   PID Category  column1 column2 column3
 1: 123        1 54.00000    2.40   23.35
 2: 324        1 52.00000    3.00   21.10
 3: 356        1 53.00000    3.60   25.60
 4: 378        2 56.00000    3.20   27.10
 5: 395        2 50.50000    3.50   29.90
 6: 362        2 45.00000    3.35   24.30
 7: 789        3 65.00000   12.60   23.80
 8: 759        3 66.00000   10.55   26.80
 9: 762        3 66.66667   10.55   27.20
10: 741        3 69.00000    8.50   23.30

【讨论】：