有没有办法在 R 中折叠加权平均值？答案

【问题标题】：Is there a way to collapse weighted means in R?有没有办法在 R 中折叠加权平均值？
【发布时间】：2020-08-14 09:26:26
【问题描述】：

我正在尝试将以下代码从 Stata 转换为 R：

collapse (mean) erate_total_male laborforce_male erate_total_male_1953 laborforce_male_1953 share_expellees_male share_dest_flats instrument share_agric_1939 city_state (max) occzone_occu [aw=laborforce_male], by(bundesland_id_1953 occupation_id)

我尝试在 R 中使用 collapse 包，但我不确定如何合并 Stata 代码的权重元素或最大值（尽管我可能只是生成一个新变量来解决这个问题） .

test1 <- rep_data %>%
  mutate(bundesland_id_1953 = 
           case_when(
             bundesland_id == 8 ~ 99,
             bundesland_id == 9 ~ 99,
             bundesland_id == 10 ~ 99,
           )) %>%
  group_by(bundesland_id_1953, occupation_id) %>% 
  select(erate_total_male, laborforce_male, erate_total_male_1953, laborforce_male_1953, share_expellees_male, share_dest_flats, instrument_male, share_agric_1939, city_state, occzone_occu) %>% fmean

我也尝试为所有变量生成均值，但在添加权重时遇到了同样的问题：

t6Data2 <- rep_data %>%
  mutate(bundesland_id_1953 = 
           case_when(
             bundesland_id == 8 ~ 99,
             bundesland_id == 9 ~ 99,
             bundesland_id == 10 ~ 99,
           )) %>% 
  group_by(bundesland_id_1953, occupation_id) %>% summarise_at(vars(erate_total_male, laborforce_male, erate_total_male_1953, laborforce_male_1953, share_expellees_male, share_dest_flats, instrument_male, share_agric_1939, city_state)

最后，我尝试了一个循环，但是当我使用 lm() 运行回归时，我的变量没有出现：

test444 <- rep_data %>%
  mutate(bundesland_id_1953 = 
           case_when(
             bundesland_id == 8 ~ 99,
             bundesland_id == 9 ~ 99,
             bundesland_id == 10 ~ 99,
           )) %>% 
  group_by(bundesland_id_1953, occupation_id)

t6_data_test4 <- sapply(c(test444$erate_total_male, test444$laborforce_male, test444$erate_total_male_1953, test444$laborforce_male_1953, test444$share_expellees_male, test444$share_dest_flats, test444$instrument_male, test444$share_agric_1939, test444$city_state), function(x) {
  weighted.mean(x, weight = laborforce_male)
})

我不知道该怎么做，但如果能提供任何帮助，我将不胜感激。我是一个相对新手，所以对于我在代码中犯的任何明显错误，我深表歉意。

【问题讨论】：

如果您使用dput 共享数据并显示相同的预期输出，则更容易提供帮助。请阅读有关how to ask a good question 的信息以及如何提供reproducible example。
我是Stata的人，可以看出你的目标是对Stata也非常了解的R人。如果您展示一个包含几行（观察）和几列（变量）的非常小的示例数据集，并且在计算平均值时直接解释分析权重的含义，您更有可能获得详细的响应。这就是[aw=...] 语法。事实上，Stata 语法在这里可以说是无关紧要的。您可以直接询问如何在 R 中做您想做的事情。

标签： r stata mean collapse weighted

【解决方案1】：

这行得通：

library(dplyr)

d <- tibble::tibble(
  bundesland_id_1953 = sample(letters[1:3], 100, replace = TRUE),
  occupation_id = factor(sample(1:3, 100, replace = TRUE)),
  x = sample(1:5, 100, replace = TRUE),
  y = sample(1:5, 100, replace = TRUE),
  weight = runif(100)
)

d <- group_by(d, bundesland_id_1953, occupation_id)

bind_cols(
  group_keys(d),
  group_split(d) %>% 
    purrr::map_df(
      # [NOTE] use `across` in forthcoming dplyr 1.0.0
      ~ summarise_at(.x, vars(x, y), weighted.mean, w = .x$weight)
    )
)

我对下面的解决方案不满意，因为它比“整洁”工具提供的更难看。该死，它的可读性不如 Stata —— 我对自己很失望。

我也对您的加权方案持怀疑态度：在某些时候，您似乎是在对一个变量……本身进行加权？但我当然不知道数据。

【讨论】：

【解决方案2】：

是的，您在 R 中的 STATA 代码的忠实翻译是：

library(collapse)
collap(data, by = ~ bundesland_id_1953 + occupation_id, 
       custom = list(fmean = .c(rate_total_male, laborforce_male, erate_total_male_1953, laborforce_male_1953, 
                                share_expellees_male, share_dest_flats instrument share_agric_1939, city_state), 
                     fmax_uw = "occzone_occu"), w = ~ laborforce_male)

注意_uw 后缀用于fmax，根据文档?collap，以避免传递给fmax（无法处理权重）的权重向量引起未使用的参数警告。另请注意，collap 默认为keep.w = TRUE 和wFUN = fsum，因此您的权重向量"laborforce_male" 也将使用总和进行聚合。另一个折叠选项，使用更多类似 dplyr 的代码是（让 ... 成为上面代码中 .c 内的变量，不带引号输入）：

library(magrittr)
data %>% fgroup_by(bundesland_id_1953, occupation_id) %>% 
   collapg(custom = list(fmean = .c(...), fmax_uw = "occzone_occu"), 
           w = laborforce_male)

最后，如果您喜欢使用管道进行编程，您还可以使用以下方法从头开始构建它：

data %>% fgroup_by(bundesland_id_1953, occupation_id) %>% {
     add_vars(fselect(., ...) %>% fmean(laborforce_male), 
              fselect(., occzone_occu) %>% fmax(keep.group_vars = FALSE)) 
}

如果您只使用加权平均值进行聚合，则后一个表达式可能会变得更简单：

data %>% fgroup_by(bundesland_id_1953, occupation_id) %>% 
     fselect(...) %>% fmean(laborforce_male)

让我知道您在理解 collap 的文档时遇到的困难。

【讨论】：