【问题标题】:Grouping Column Values in R (table function)在 R 中对列值进行分组(表函数)
【发布时间】:2021-12-11 01:01:39
【问题描述】:

我在 R 中有下表:

table image

我想做的是将一些教育程度值组合在一起:

1) group 102-111 --> less than 9th grade
2) group 113-116 --> 9-12th grade no hs degree
3) 201 --> high school diploma
4) 301 --> some college no degree
5) 302-303 --> associate degree
6) 400 --> bachelor degree
7) 501 --> master degree
8) 502 --> professional degree
9) 503 --> doctorate degree (PhD)

我该怎么做呢?谢谢

dput R 输出:

structure(c(99, 500, 31, 44, 64, 68, 100, 312, 147, 405, 444, 
514, 692, 624, 7055, 986, 6260, 2235, 1761, 6732, 3212, 439, 
581, 33305, 39, 207, 10, 21, 28, 18, 33, 120, 51, 178, 211, 267, 
320, 214, 2088, 487, 2071, 636, 477, 1213, 493, 71, 76, 9329, 
65, 402, 14, 28, 50, 27, 45, 151, 79, 209, 316, 367, 437, 354, 
4340, 748, 4186, 1440, 1155, 3824, 1671, 253, 303, 20464, 203, 
1109, 55, 93, 142, 113, 178, 583, 277, 792, 971, 1148, 1449, 
1192, 13483, 2221, 12517, 4311, 3393, 11769, 5376, 763, 960, 
63098), .Dim = c(24L, 4L), .Dimnames = list(EDUC = c("102", "103", 
"104", "105", "106", "107", "108", "109", "110", "111", "113", 
"114", "115", "116", "201", "202", "301", "302", "303", "400", 
"501", "502", "503", "Sum"), DEPFEELEVL = c("1", "2", "3", "Sum"
)), class = c("table", "matrix", "array"))

想要对相似的元素进行分组而不只是重命名:

                           DEPFEELEVL
EDUC                            1     2     3   Sum
  less than 9th grade          99    39    65   203
  less than 9th grade         500   207   402  1109
  less than 9th grade          31    10    14    55
  less than 9th grade          44    21    28    93
  less than 9th grade          64    28    50   142
  less than 9th grade          68    18    27   113
  less than 9th grade         100    33    45   178
  less than 9th grade         312   120   151   583
  less than 9th grade         147    51    79   277
  less than 9th grade         405   178   209   792
  9-12th grade no hs degree   444   211   316   971
  9-12th grade no hs degree   514   267   367  1148
  9-12th grade no hs degree   692   320   437  1449
  9-12th grade no hs degree   624   214   354  1192
  high school diploma        7055  2088  4340 13483
  doctorate degree (PhD)      986   487   748  2221
  some college no degree     6260  2071  4186 12517
  doctorate degree (PhD)     2235   636  1440  4311
  doctorate degree (PhD)     1761   477  1155  3393
  bachelor degree            6732  1213  3824 11769
  master degree              3212   493  1671  5376
  professional degree         439    71   253   763
  doctorate degree (PhD)      581    76   303   960
  Sum                       33305  9329 20464 63098

因此,例如,让我们考虑受教育程度为 9-12 年级的受访者,为简单起见,考虑抑郁症 1 级。

表格应该显示

444+514+692+624 = 2274

对于我用作示例的特定单元格

【问题讨论】:

  • 是的,我真的不能这样做,因为数据来自 IPUMS,即使我清理了它,我也有 63098 个数据点,并且大部分 dput() 函数都没有显示在控制台中,因为它是如此大
  • 该链接无效吗?我也会放原始数据,但我没有足够的信誉使它看起来像一张完整的图片
  • 哦,我看到我粘贴了表格数据,不确定是否有帮助
  • 使用 dput(addmargins(table1)) 并发布 R 返回的内容。这样其他人就可以复制粘贴您的表格。
  • 是的,这是一回事。见here

标签: r


【解决方案1】:

设置具有定义间隔的组,间隔定义在左侧打开并在右侧关闭(a,b]。

groups <- 
  c('na' = 0, 
    'less than 9th' = 111, 
    'no hs degree' = 116, 
    'hs diploma' = 201, 
    'some college' = 301, 
    'associate' = 303, 
    'bachelor' = 400,
    'master' = 501,
    'professional' = 502,
    'PhD' = 503)
    

tbl 提供的数据开始。转换成tibble(data.frame),然后用上面的groups定义分组变量,总结一下。

library(dplyr, warn.conflicts = FALSE)
library(tidyr)


tbl %>% 
  as_tibble() %>% 
  pivot_wider(EDUC, names_from = DEPFEELEVL, values_from = n, 
              names_prefix = 'l') %>% 
  mutate(EDUC = as.numeric(EDUC)) %>% 
  group_by(educ =  
    names(groups)[1 + cut(as.numeric(EDUC), groups, labels = FALSE)]
  ) %>% 
  summarise(across(everything(), sum))
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
#> # A tibble: 10 × 6
#>    educ           EDUC    l1    l2    l3  lSum
#>    <chr>         <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 associate       605  3996  1113  2595  7704
#>  2 bachelor        400  6732  1213  3824 11769
#>  3 hs diploma      201  7055  2088  4340 13483
#>  4 less than 9th  1065  1770   705  1070  3545
#>  5 master          501  3212   493  1671  5376
#>  6 no hs degree    458  2274  1012  1474  4760
#>  7 PhD             503   581    76   303   960
#>  8 professional    502   439    71   253   763
#>  9 some college    503  7246  2558  4934 14738
#> 10 <NA>             NA 33305  9329 20464 63098

reprex package (v2.0.1) 于 2021 年 12 月 10 日创建

【讨论】:

    【解决方案2】:

    您可以使用嵌套的ifelse 代码重新编码这些值。见这里:

    x <- as.numeric(rownames(df))
    rownames(df) <- ifelse(is.na(x), "Sum",
    ifelse(x>= 102 & x<= 111, "less than 9th grade",
    ifelse(x >=113 & x<=116, "9-12th grade no hs degree",
    ifelse(x==201, "high school diploma",
    ifelse(x==301, "some college no degree",
    ifelse(x>=320&x<=303, "associate degree",
    ifelse(x==400, "bachelor degree",
    ifelse(x==501, "master degree",
    ifelse(x==502, "professional degree", "doctorate degree (PhD)")))))))))
    

    这么多ifelse 是一团糟。我写了一个小函数来避免这种情况,你可以看到它here。但是您必须根据您的情况调整功能。我使用的数据来自您的问题,请使用df&lt;- structure(c(99, 500, ...))

    对于每组使用的总和

    df_new <- data.frame(degree= rownames (df), sum= df[ , "Sum"])
    aggregate(sum ~ degree, df_new, sum)
    

    【讨论】:

    • 好的,谢谢。但更具体地说,我指的是把相似的元素组合在一起,而不仅仅是重命名它们。当我运行你的函数时,我得到了教育程度值,但我也想总结属于同一类别的所有元素。我在上面举了一个例子来说明我所指的
    • 是的,这就是为什么我说将它们组合在一起很好,尽管谢谢我在网上找到了资源
    • @raduaelxe 查看更新
    【解决方案3】:

    这是一个使用基础 R 的可能想法。

    x <- as.numeric(rownames(tbl))
    x <- 
      cut(x, breaks = c(0, 111, 116, 201, 301, 303, 400, 501, 502, 503),
             labels = c("less than 9th grade",
                        "9-12th grade no hs degree",
                        "high school diploma",
                        "some college no degree",
                        "associate degree",
                        "bachelor degree",
                        "master degree",
                        "professional degree",
                        "doctorate degree (PhD)"))
    x <- as.character(x)
    x[is.na(x)] <- "Sum"
    rownames(tbl) <- x
    df <- aggregate(Freq ~ EDUC + DEPFEELEVL, data = tbl, FUN = sum)
    df <- reshape(data = df, idvar = "EDUC", timevar = "DEPFEELEVL", direction = "wide")
    names(df) <- c("EDUC", "DEPFEELEVL_1", "DEPFEELEVL_2", "DEPFEELEVL_3", "DEPFEELEVL_Sum")
    

    输出如下所示:

    > df
                            EDUC DEPFEELEVL_1 DEPFEELEVL_2 DEPFEELEVL_3 DEPFEELEVL_Sum
    1        less than 9th grade         1770          705         1070           3545
    2  9-12th grade no hs degree         2274         1012         1474           4760
    3        high school diploma         7055         2088         4340          13483
    4     some college no degree         7246         2558         4934          14738
    5           associate degree         3996         1113         2595           7704
    6            bachelor degree         6732         1213         3824          11769
    7              master degree         3212          493         1671           5376
    8        professional degree          439           71          253            763
    9     doctorate degree (PhD)          581           76          303            960
    10                       Sum        33305         9329        20464          63098
    ```
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2013-05-27
      • 2021-12-08
      • 1970-01-01
      • 2012-04-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多