【问题标题】:Factor ordering with forcats使用 forcats 进行因子排序
【发布时间】:2021-08-19 02:06:42
【问题描述】:

我有数据要分箱并转换为因子。不过,我在理解因子变量的情况时遇到了一些麻烦。我正在尝试根据连续变量对因子变量进行排序。

我已经阅读了它,但是我看到的所有示例都只包含每个因子级别的一个实例,而我的示例包含多个因子级别的实例。

这是示例数据:

df <- structure(list(Group = c("Grp1", "Grp1", "Grp1", "Grp1", "Grp1", 
"Grp1", "Grp1", "Grp2", "Grp2", "Grp2", "Grp2", "Grp2"), Ind = c("A", 
"B", "C", "D", "E", "F", "G", "A", "B", "C", "D", "E"), Value = c(0.155903329567489, 
0.0582906870761889, 0.180600101489814, 0.26357423622443, 0.0637832368895064, 
0.213803701918138, 0.0640447068344333, 0.333501508730367, 0.160676738803951, 
0.279178514111584, 0.145767023637501, 0.0808762147165962)), row.names = c(NA, 
-12L), class = c("tbl_df", "tbl", "data.frame"))

根据这些数据,我创建了一个因子并检查了每个元素的顺序。

library(dplyr)
library(forcats)
df %>% 
  group_by(Group) %>% 
  mutate(Bin = cut_interval(Value, n = nrow(.))) %>% 
  mutate(Order = labels(Bin)) %>% 
  ungroup()

# A tibble: 12 x 5
   Group Ind    Value Bin             Order
   <chr> <chr>  <dbl> <fct>           <chr>
 1 Grp1  A     0.156  (0.144,0.161]   1    
 2 Grp1  B     0.0583 [0.0583,0.0754] 2    
 3 Grp1  C     0.181  (0.178,0.195]   3    
 4 Grp1  D     0.264  (0.246,0.264]   4    
 5 Grp1  E     0.0638 [0.0583,0.0754] 5    
 6 Grp1  F     0.214  (0.212,0.229]   6    
 7 Grp1  G     0.0640 [0.0583,0.0754] 7    
 8 Grp2  A     0.334  (0.312,0.334]   1    
 9 Grp2  B     0.161  (0.144,0.165]   2    
10 Grp2  C     0.279  (0.27,0.291]    3    
11 Grp2  D     0.146  (0.144,0.165]   4    
12 Grp2  E     0.0809 [0.0809,0.102]  5

然后在创建它后尝试根据“值”对因子重新排序,但顺序似乎没有改变。

df %>% 
  group_by(Group) %>% 
  mutate(Bin = cut_interval(Value, n = nrow(.)), 
         Bin = fct_reorder(Bin, Value)) %>% 
  mutate(Order = labels(Bin)) %>% 
  ungroup()

# A tibble: 12 x 5
   Group Ind    Value Bin             Order
   <chr> <chr>  <dbl> <fct>           <chr>
 1 Grp1  A     0.156  (0.144,0.161]   1    
 2 Grp1  B     0.0583 [0.0583,0.0754] 2    
 3 Grp1  C     0.181  (0.178,0.195]   3    
 4 Grp1  D     0.264  (0.246,0.264]   4    
 5 Grp1  E     0.0638 [0.0583,0.0754] 5    
 6 Grp1  F     0.214  (0.212,0.229]   6    
 7 Grp1  G     0.0640 [0.0583,0.0754] 7    
 8 Grp2  A     0.334  (0.312,0.334]   1    
 9 Grp2  B     0.161  (0.144,0.165]   2    
10 Grp2  C     0.279  (0.27,0.291]    3    
11 Grp2  D     0.146  (0.144,0.165]   4    
12 Grp2  E     0.0809 [0.0809,0.102]  5 

然后我在创建因子之前将数据排列在“价值”上,并得到了正确的顺序。

df %>% 
  arrange(Group, Value) %>% 
  group_by(Group) %>% 
  mutate(Bin = cut_interval(Value, n = nrow(.))) %>% 
  mutate(Order = labels(Bin)) %>% 
  ungroup()

# A tibble: 12 x 5
   Group Ind    Value Bin             Order
   <chr> <chr>  <dbl> <fct>           <chr>
 1 Grp1  B     0.0583 [0.0583,0.0754] 1    
 2 Grp1  E     0.0638 [0.0583,0.0754] 2    
 3 Grp1  G     0.0640 [0.0583,0.0754] 3    
 4 Grp1  A     0.156  (0.144,0.161]   4    
 5 Grp1  C     0.181  (0.178,0.195]   5    
 6 Grp1  F     0.214  (0.212,0.229]   6    
 7 Grp1  D     0.264  (0.246,0.264]   7    
 8 Grp2  E     0.0809 [0.0809,0.102]  1    
 9 Grp2  D     0.146  (0.144,0.165]   2    
10 Grp2  B     0.161  (0.144,0.165]   3    
11 Grp2  C     0.279  (0.27,0.291]    4    
12 Grp2  A     0.334  (0.312,0.334]   5

那么首先,为什么fct_reorder 没有做我想做的事?其次,为什么“Grp1”中有 7 个值,“Grp2”中有 5 个值?由于每组中重复的“Bin”值,不应该分别只有 5 个和 4 个吗?

【问题讨论】:

    标签: r forcats


    【解决方案1】:

    订购的是levels。根据?fct_reorder

    .x, .y - f 的级别被重新排序,因此 .fun(.x)(对于 fct_reorder())和 fun(.x, .y)(对于 fct_reorder2())的值是升序的顺序。

    arrangeBin 之后,通过在删除未使用的级别 (droplevels) 后转换为 integer 创建“订单”

    library(dplyr)
    library(forcats)
    out <- df %>% 
      group_by(Group) %>% 
      mutate(Bin = cut_interval(Value, n = nrow(.)), 
             Bin = fct_reorder(Bin, Value)) %>% 
      arrange(as.integer(Bin)) %>%
      mutate(Order = as.integer(droplevels(Bin))) %>%
      ungroup
    out
    # A tibble: 12 x 5
       Group Ind    Value Bin             Order
       <chr> <chr>  <dbl> <fct>           <int>
     1 Grp1  B     0.0583 [0.0583,0.0754]     1
     2 Grp1  E     0.0638 [0.0583,0.0754]     1
     3 Grp1  G     0.0640 [0.0583,0.0754]     1
     4 Grp1  A     0.156  (0.144,0.161]       2
     5 Grp1  C     0.181  (0.178,0.195]       3
     6 Grp1  F     0.214  (0.212,0.229]       4
     7 Grp1  D     0.264  (0.246,0.264]       5
     8 Grp2  E     0.0809 [0.0809,0.102]      1
     9 Grp2  B     0.161  (0.144,0.165]       2
    10 Grp2  D     0.146  (0.144,0.165]       2
    11 Grp2  C     0.279  (0.27,0.291]        3
    12 Grp2  A     0.334  (0.312,0.334]       4
    

    或者使用matchunique

     df %>% 
      group_by(Group) %>% 
      mutate(Bin = cut_interval(Value, n = nrow(.)), 
             Bin = fct_reorder(Bin, Value)) %>% 
      arrange(as.integer(Bin))  %>% mutate(Order = match(Bin, unique(Bin))) %>%
      ungroup
    # A tibble: 12 x 5
       Group Ind    Value Bin             Order
       <chr> <chr>  <dbl> <fct>           <int>
     1 Grp1  B     0.0583 [0.0583,0.0754]     1
     2 Grp1  E     0.0638 [0.0583,0.0754]     1
     3 Grp1  G     0.0640 [0.0583,0.0754]     1
     4 Grp1  A     0.156  (0.144,0.161]       2
     5 Grp1  C     0.181  (0.178,0.195]       3
     6 Grp1  F     0.214  (0.212,0.229]       4
     7 Grp1  D     0.264  (0.246,0.264]       5
     8 Grp2  E     0.0809 [0.0809,0.102]      1
     9 Grp2  B     0.161  (0.144,0.165]       2
    10 Grp2  D     0.146  (0.144,0.165]       2
    11 Grp2  C     0.279  (0.27,0.291]        3
    12 Grp2  A     0.334  (0.312,0.334]       4
    

    关于fct_reorder没有做任何事情,检查`step前后的levels

    > tmp <-  df %>% 
      group_by(Group) %>% 
      mutate(Bin = cut_interval(Value, n = nrow(.)))
    > tmp %>% pull(Bin) %>% levels
     [1] "[0.0583,0.0754]" "(0.0754,0.0925]" "(0.0925,0.11]"   "(0.11,0.127]"    "(0.127,0.144]"   "(0.144,0.161]"   "(0.161,0.178]"   "(0.178,0.195]"   "(0.195,0.212]"  
    [10] "(0.212,0.229]"   "(0.229,0.246]"   "(0.246,0.264]"   "[0.0809,0.102]"  "(0.102,0.123]"   "(0.123,0.144]"   "(0.144,0.165]"   "(0.165,0.186]"   "(0.186,0.207]"  
    [19] "(0.207,0.228]"   "(0.228,0.249]"   "(0.249,0.27]"    "(0.27,0.291]"    "(0.291,0.312]"   "(0.312,0.334]"  
    > tmp %>% mutate(Bin = fct_reorder(Bin, Value))  %>% pull(Bin) %>% levels
     [1] "[0.0583,0.0754]" "(0.144,0.161]"   "(0.178,0.195]"   "(0.212,0.229]"   "(0.246,0.264]"   "(0.0754,0.0925]" "(0.0925,0.11]"   "(0.11,0.127]"    "(0.127,0.144]"  
    [10] "(0.161,0.178]"   "(0.195,0.212]"   "(0.229,0.246]"   "[0.0809,0.102]"  "(0.102,0.123]"   "(0.123,0.144]"   "(0.144,0.165]"   "(0.165,0.186]"   "(0.186,0.207]"  
    [19] "(0.207,0.228]"   "(0.228,0.249]"   "(0.249,0.27]"    "(0.27,0.291]"    "(0.291,0.312]"   "(0.312,0.334]"  
    

    【讨论】:

    • @hmhensen levels 未使用的级别让人有点难以理解。如果你直接使用 as.integer 而不删除,你会得到像 26 或 28 之类的值。使用 match,它不是查看级别,而是查看已经 arranged 的值及其独特元素
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-12-16
    • 2014-05-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多