R中第二个变量作为“分析权重”的频率表答案

【问题标题】：Frequency table with second variable as "analytic weight" in RR中第二个变量作为“分析权重”的频率表
【发布时间】：2020-04-24 10:16:54
【问题描述】：

我想在 R 中创建一个频率表，将另一个变量视为权重。

更准确地说，作为“分析权重”，例如在 Stata 中。根据其帮助文件，

aweights, or analytic weights, are weights that are inversely
        proportional to the variance of an observation; i.e., the variance of
        the jth observation is assumed to be sigma^2/w_j, where w_j are the
        weights.  Typically, the observations represent averages and the
        weights are the number of elements that gave rise to the average.
        For most Stata commands, the recorded scale of aweights is
        irrelevant; Stata internally rescales them to sum to N, the number of
        observations in your data, when it uses them.

stackflow 成员的宝贵贡献是：

Table_WEIGHT <- xtabs(WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
> Prop <- prop.table(Table_WEIGHT)
> Cum <- cumsum(100 * Prop / sum(Prop))
> Cum
        1         2         3         4         5         6         7 
 14.35397  29.14973  43.23935  57.31355  71.50782  85.80359 100.00000 
> out <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
+                   Prop = as.numeric(Prop), Cum = as.numeric(Cum))
> out
  INTERVIEW_DAY        Freq      Prop       Cum
1             1 11803438268 0.1435397  14.35397
2             2 12166729888 0.1479576  29.14973
3             3 11586059070 0.1408962  43.23935
4             4 11573379591 0.1407420  57.31355
5             5 11672116808 0.1419427  71.50782
6             6 11755579310 0.1429577  85.80359
7             7 11673877965 0.1419641 100.00000

尽管如此，频率仍然不是我所期望的，因为我们使用第二个变量的总和作为权重，而不是上面设置的“分析权重”。

所需的表应该是：

 (mean) |
interview_d |
         ay |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 | 2,974.1424       14.35       14.35
          2 | 3,065.6819       14.80       29.15
          3 | 2,919.3688       14.09       43.24
          4 |2,916.17392       14.07       57.31
          5 |2,941.05299       14.19       71.51
          6 | 2,962.0832       14.30       85.80
          7 | 2,941.4968       14.20      100.00
------------+-----------------------------------
      Total |     20,720      100.00

请注意，“频率”是完全不同的。

这里是两个变量 (INTERVIEW_DATE) 和 WEIGHT(WEIGHT) 的示例，它们是调查日期和原始文章中未指定的重量。

> timeuse_2003$INTERVIEW_DATE[1:15]
 [1] "2003-01-03" "2003-01-04" "2003-01-04" "2003-01-02" "2003-01-09" "2003-01-02" "2003-01-06"
 [8] "2003-01-07" "2003-01-04" "2003-01-09" "2003-01-04" "2003-01-05" "2003-01-04" "2003-01-01"
[15] "2003-01-04"


> timeuse_2003$WEIGHT[1:15]
 [1] 8155462.7 1735322.5 3830527.5 6622023.0 3068387.3 3455424.9 1637826.3 6574426.8 1528296.3
[10] 4277052.8 1961482.3  505227.2 2135476.8 5366309.3 1058351.1

我会感谢任何贡献。

【问题讨论】：

欢迎来到 Stack Overflow！你能提供一个minimal, reproducible example 你的数据吗？
嗨@M--，感谢您的互动。我将使用变量样本更新问题。该数据集还有 69 个其他变量。 20720 个观测值。你还需要什么吗？
在运行代码重现问题之前，我无法确定需要什么。如果你点击我分享的链接，它会告诉你需要什么以及如何确保你包含了所有需要的东西。干杯。
问题其实是stackoverflow.com/questions/59555243/…的延续。在那里我得到了频率表的支持。现在我试图弄清楚如何生成 Stata 的这种“分析权重”，但在 R 中。我将使用两张表更新问题，我拥有的一张和我需要的一张。

标签： r frequency

【解决方案1】：

您所要求的可以如下完成：

library(tidyverse)

a <- tibble(interview_day = 1:7,
            frequency = c(2974.1424, 3065.6819, 2919.3688, 2916.17392, 2941.05299, 2962.0832, 2941.4968)) %>%
  mutate(percent = frequency/sum(frequency),
         cum_pct = cumsum(percent)) %>%
  bind_rows(t(colSums(.)[2:3]) %>% as.data.frame())

这是一个仅使用基础 R 的解决方案：

df <- data.frame(frequency = c(2974.1424, 3065.6819, 2919.3688, 2916.17392, 2941.05299, 2962.0832, 2941.4968))
df$interview_day <- 1:nrow(df)
df$percent <- df$frequency/sum(df$frequency)
df$cum_pct <- cumsum(df$percent)

【讨论】：

非常感谢@Jakub.Novotny。但是，除了我在问题中发布的所需输出之外，我仍然得到另一个 Freq 值。无论如何，是否有可能在 R-Base 上实现所需的频率，无需封装？
我已经添加了一个基本的 R 解决方案，没有包需要使用额外的包。

【解决方案2】：

我根据 Stata 帮助文件找到了一个不优雅的解决方案。我刚刚加了一行

timeuse_2003$N_WEIGHT <- timeuse_2003$WEIGHT * 20720/ sum(timeuse_2003$WEIGHT)

并用

保留代码

Table_WEIGHT <- xtabs(N_WEIGHT ~ INTERVIEW_DAY, timeuse_2003)
Prop <- prop.table(Table_WEIGHT)
Cum <- cumsum(100 * Prop / sum(Prop))
Cum
Freq_Table <- data.frame(INTERVIEW_DAY = names(Table_WEIGHT), Freq = as.numeric(Table_WEIGHT),
                  Prop = as.numeric(Prop), Cum = as.numeric(Cum))
Freq_Table

然后表格是正确的，例如：

> Freq_Table
  INTERVIEW_DAY      Freq       Prop        Cum
1             1 2974.1424 0.14353969  14.353969
2             2 3065.6819 0.14795762  29.149731
3             3 2919.3688 0.14089618  43.239349
4             4 2916.1739 0.14074198  57.313547
5             5 2941.0530 0.14194271  71.507819
6             6 2962.0832 0.14295769  85.803587
7             7 2941.4968 0.14196413 100.000000

如果有人能阐明如何用我手动输入的观察次数代替自动输入的观察次数（此代码将用于不同的数据集中，因此我无法更新每一个数据集，每次都切换观察次数。类似于“.N”会很好！

谢谢！

【讨论】：