R可以为一个变量生成频率表吗？答案

【问题标题】：Can R produce frequency tables for one variable?R可以为一个变量生成频率表吗？
【发布时间】：2020-06-12 18:21:18
【问题描述】：

免责声明：我是 Stackoverflow 和 R 的新手：

我有一个名为“house”的数据集，其中包含多列。我试图为每一列获取不同的频率表，而不是让它们相互交互。对于这个例子：我试图获得有游泳池和没有游泳池的房子的总数。有点像 SAS 中的 proc freq，你在哪里做：

proc freq data = house; 

   tables pool backyard park_near / missing list; 

run;

但没有任何变量相互影响。

我在 R 中使用以下代码：

freq_2<- freqlist(table(house[("Pool")], useNA = "ifany"))

print.noquote(head(as.data.frame(freq_2), n=100L))

但是，我得到：

    Var1     Freq cumFreq freqPercent cumPercent

1     N        64   64      88.88889     88.88889 

2     Y         8   72      11.11111    100.00000

无论如何我可以得到“Pool”而不是“Var1”吗？在 R 中还有更简单的方法吗？

提前感谢您的帮助。

【问题讨论】：

标签： r

【解决方案1】：

带有`arsenal::freqlist()`的基础R

问题中的代码使用arsenal::freqlist() 复制来自 SAS PROC FREQ 的输出。不幸的是，当将freqlist() 结果打印为数据框时，第一列始终呈现为Var1，而不管正在打印其频率的实际变量。

一个非常简单的解决方法是在打印之前通过colnames() 函数重命名列。我们可以将其与lapply() 结合起来，为数据框中的多个列创建频率表。

这是一个将 lapply() 与 colnames() 组合在一起的示例，以使用来自 mtcars 数据集的数据重命名多个频率表的 Var1 列，因为 OP 不包含可重现的示例。

library(arsenal)
lapply(c("cyl","am","carb"),function(x,y){
     freqs <- freqlist(table(y[x],useNA = "ifany"))
     freq_df <- as.data.frame(freqs)
     colnames(freq_df)[1] <- x
     freq_df
},mtcars)

...和输出：

[[1]]
  cyl Freq cumFreq freqPercent cumPercent
1   4   11      11      34.375     34.375
2   6    7      18      21.875     56.250
3   8   14      32      43.750    100.000

[[2]]
  am Freq cumFreq freqPercent cumPercent
1  0   19      19      59.375     59.375
2  1   13      32      40.625    100.000

[[3]]
  carb Freq cumFreq freqPercent cumPercent
1    1    7       7      21.875     21.875
2    2   10      17      31.250     53.125
3    3    3      20       9.375     62.500
4    4   10      30      31.250     93.750
5    6    1      31       3.125     96.875
6    8    1      32       3.125    100.000

tidyverse 解决方案

但是，我们可以使用dplyr 和tidyr 的组合生成本质上与freqlist() 相似的输出。首先，我们选择我们想要制表的列，然后我们转换为窄格式的整洁数据。接下来，我们summarise() 对每个变量的每个值进行计数，并计算累积频率和百分比。

mtcars %>% mutate(model = rownames(.)) %>% 
     group_by(model) %>% select(model,cyl,carb,am) %>% 
     pivot_longer(.,-model,names_to = "variable",values_to = "value") %>% 
     mutate(count = 1) %>% group_by(variable,value) %>%
     summarise(freq = sum(count)) %>% group_by(variable) %>%
     mutate(cumFreq = cumsum(freq),
               pct = freq / sum(freq) * 100,
            cumPct = cumsum(pct)) -> freqData

我们使用filter() 打印每个变量的行。

> freqData %>% filter(variable == "am")
# A tibble: 2 x 6
# Groups:   variable [1]
  variable value  freq cumFreq   pct cumPct
  <chr>    <dbl> <dbl>   <dbl> <dbl>  <dbl>
1 am           0    19      19  59.4   59.4
2 am           1    13      32  40.6  100  
> freqData %>% filter(variable == "cyl")
# A tibble: 3 x 6
# Groups:   variable [1]
  variable value  freq cumFreq   pct cumPct
  <chr>    <dbl> <dbl>   <dbl> <dbl>  <dbl>
1 cyl          4    11      11  34.4   34.4
2 cyl          6     7      18  21.9   56.2
3 cyl          8    14      32  43.8  100  
> freqData %>% filter(variable == "carb")
# A tibble: 6 x 6
# Groups:   variable [1]
  variable value  freq cumFreq   pct cumPct
  <chr>    <dbl> <dbl>   <dbl> <dbl>  <dbl>
1 carb         1     7       7 21.9    21.9
2 carb         2    10      17 31.2    53.1
3 carb         3     3      20  9.38   62.5
4 carb         4    10      30 31.2    93.8
5 carb         6     1      31  3.12   96.9
6 carb         8     1      32  3.12  100  
>

对打印数据的代码稍作调整，我们可以删除variable 列，并用原始变量名重命名value 列。正如我们对arsenal 解决方案所做的那样，我们可以使用lapply() 使用分类变量列表自动执行此操作。

lapply(c("am","cyl","carb"),
                    function(x){
                       colnames(freqData)[2] <- x
                       y <- freqData[freqData$variable == x,][,-1]
                       rownames(y) <- NULL
                       y
                    })

...和输出：

[[1]]
  am freq cumFreq    pct  cumPct
1  0   19      19 59.375  59.375
2  1   13      32 40.625 100.000

[[2]]
  cyl freq cumFreq    pct  cumPct
1   4   11      11 34.375  34.375
2   6    7      18 21.875  56.250
3   8   14      32 43.750 100.000

[[3]]
  carb freq cumFreq    pct  cumPct
1    1    7       7 21.875  21.875
2    2   10      17 31.250  53.125
3    3    3      20  9.375  62.500
4    4   10      30 31.250  93.750
5    6    1      31  3.125  96.875
6    8    1      32  3.125 100.000

最后，我们可以通过引入knitr::kable() 来改进输出的外观。

library(knitr)
atable <- freqData %>% filter(variable == "am") %>% rename(.,am = value) %>% select(-variable)
kable(atable)

| am| freq| cumFreq|    pct|  cumPct|
|--:|----:|-------:|------:|-------:|
|  0|   19|      19| 59.375|  59.375|
|  1|   13|      32| 40.625| 100.000|

在 Web 浏览器（或 R Markdown 文档）中呈现时，结果如下所示。

作为参考，SAS 等效项如下所示：

filename cars "/folders/myfolders/data/mtcars.csv";

data mtcars;
   infile cars dlm="," firstobs = 2;
   input car $ mpg cyl disp hp drat wt qsec vs am gear carb;
   run;

proc freq data = mtcars; 
   tables cyl  am carb / missing list; 
   run;

...和输出：

【讨论】：

谢谢，这真的很有帮助！我正准备使用您介绍的功能来获取我需要的数字。在您创建的函数中，通过使用“cyl”，您是在引入函数之前引入函数的 X 变量？然后 mtcars 是 Y 变量？”我只是想确保我理解函数的顺序和发生了什么。另外，[x] 是在引入一个列表吗？
@Yami2018 - 是的，匿名函数中的第一个参数x 是变量名称，第二个参数y 是数据框名称。此外，lapply() 的输出是 list()。

【解决方案2】：

lapply(house, table)

如果您想要每列中每个唯一值的频率，这将起作用。

【讨论】：

带有arsenal::freqlist()的基础R

tidyverse 解决方案

带有`arsenal::freqlist()`的基础R