【问题标题】:Find ratio of two categorical variables查找两个分类变量的比率
【发布时间】:2018-03-30 19:42:56
【问题描述】:

我将排名、状态和计数视为通过聚合父数据框创建的数据框。我想找到如下的比率/百分比。

即,每个等级的完整和不完整总数中的不完整百分比/比率是多少。

Rank Status `n()`
   <fct> <fct>       <int> <ratio>
 1 A     Incomplete   602  
 2 A     Complete   9443    602/9443
 3 B     Incomplete  1425
 4 B     Complete  10250    ----
 5 C     Incomplete  1347   ----
 6 C     Complete   6487
 7 D     Incomplete  1118
 8 D     Complete   3967
 9 E     Incomplete   715
10 E     Complete   1948

我尝试使用 sapply() 迭代并计算比率并将其存储在另一个 df 中。但是有没有更好的方法呢?

否则,如果堆积条形图可以像上面那样标记百分比/比率,那就太好了。

我尝试的堆积条显示的是总数的百分比而不是比率。

谢谢。

【问题讨论】:

  • 不完全是一个答案,但如果你不按Status 聚合,你就不能mutate 一个新列吗?
  • @Stephan 均值怎么会变成比率,抱歉我不明白答案

标签: r plot ggplot2 dplyr


【解决方案1】:

使用dplyr:

library(dplyr)

df <- data.frame(Rank = c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E"),
             Status = c("Incomplete", "Complete","Incomplete", "Complete",
                        "Incomplete", "Complete","Incomplete", "Complete",
                        "Incomplete", "Complete"),
             Count = c(602, 9443, 1425, 10250, 1347, 6487, 1118, 3967, 715, 1948))

# Ratio
df %>% group_by(Rank) %>% mutate(Ratio = Count/sum(Count))
# A tibble: 10 x 4
# Groups:   Rank [5]
#   Rank  Status      Count  Ratio
#   <fct> <fct>       <dbl>  <dbl>
# 1 A     Incomplete   602. 0.0599
# 2 A     Complete    9443. 0.940 
# 3 B     Incomplete  1425. 0.122 
# 4 B     Complete   10250. 0.878 
# 5 C     Incomplete  1347. 0.172 
# 6 C     Complete    6487. 0.828 
# 7 D     Incomplete  1118. 0.220 
# 8 D     Complete    3967. 0.780 
# 9 E     Incomplete   715. 0.268 
#10 E     Complete    1948. 0.732 

# Percentage
df %>% group_by(Rank) %>% mutate(Percentage = (Count/sum(Count))*100)
# A tibble: 10 x 4
# Groups:   Rank [5]
#   Rank  Status      Count Percentage
#   <fct> <fct>       <dbl>      <dbl>
# 1 A     Incomplete   602.       5.99
# 2 A     Complete    9443.       94.0 
# 3 B     Incomplete  1425.       12.2 
# 4 B     Complete   10250.       87.8 
# 5 C     Incomplete  1347.       17.2 
# 6 C     Complete    6487.       82.8 
# 7 D     Incomplete  1118.       22.0 
# 8 D     Complete    3967.       78.0 
# 9 E     Incomplete   715.       26.8 
#10 E     Complete    1948.       73.2 

【讨论】:

    【解决方案2】:

    data.table 中使用dcast

    代码:

    library('data.table')
    dcast(setDT(df), formula = Rank~Status, value.var = "count")[, ratio := Incomplete / Complete][]
    

    如果您在给定等级内有重复状态,例如等级 A 有两个不完整状态,计数为 602 和 605,那么这将处理它。

    dcast(setDT(df2)[, .(count = sum(count)), by = .(Rank, Status)],  # sum count by Status and Rank
          formula = Rank~Status, value.var = "count")[, ratio := Incomplete / Complete][]
    

    输出:

    没有重复的状态

    #    Rank Complete Incomplete      ratio
    # 1:    A     9443        602 0.06375093
    # 2:    B    10250       1425 0.13902439
    # 3:    C     6487       1347 0.20764606
    # 4:    D     3967       1118 0.28182506
    # 5:    E     1948        715 0.36704312
    

    状态重复

    #    Rank Complete Incomplete     ratio
    # 1:    A     9443       1207 0.1278195
    # 2:    B    10250       1425 0.1390244
    # 3:    C     6487       1347 0.2076461
    # 4:    D     3967       1118 0.2818251
    # 5:    E     1948        715 0.3670431
    

    数据:

    没有重复的状态

    df <- read.table(text='Rank Status `n()`
                     1 A     Incomplete   602  
                     2 A     Complete   9443
                     3 B     Incomplete  1425
                     4 B     Complete  10250
                     5 C     Incomplete  1347
                     6 C     Complete   6487
                     7 D     Incomplete  1118
                     8 D     Complete   3967
                     9 E     Incomplete   715
                     10 E     Complete   1948')
    colnames(df)[3] <- 'count'
    

    状态重复:

    df2 <- read.table(text='Rank Status `n()`
                     1 A     Incomplete   602  
                     2 A     Incomplete   605
                     2.1 A     Complete   9443
                     3 B     Incomplete  1425
                     4 B     Complete  10250
                     5 C     Incomplete  1347
                     6 C     Complete   6487
                     7 D     Incomplete  1118
                     8 D     Complete   3967
                     9 E     Incomplete   715
                     10 E     Complete   1948')
    colnames(df2)[3] <- 'count'
    

    【讨论】:

      【解决方案3】:

      我没有使用 dplyr 包,但我认为以下逻辑可行: 假设您的数据框是 df

      # creating sample script as yours
      p <- c("Incomplete","Complete","Incomplete","Complete","Incomplete","Complete")
      q <- c(604,9443,1425,10250,1347,6487)
      
      # ignoring the ranks
      df <- data.frame("Status" = p,"counts" = q)
      
      
      ratiovector <- sample(c(0),size = NROW(df), replace = T)
      kcomp <- which(df$Status == "Complete")
      kincomp <- which(df$Status == "Incomplete")
      ratiovector[kcomp] <- df$counts[kincomp]/df$counts[kcomp]
      dfnew <- cbind(df,"ratio" = ratiovector)
      # print dfnew
      dfnew
      # if you want it in string form convert it.
      

      【讨论】:

      • 我收到错误Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE' 采样
      【解决方案4】:

      在基础 R 中:

      df$ratio <- ave(df$Count,df$Rank,FUN=function(x)x/sum(x))
      #    Rank     Status Count      ratio
      # 1     A Incomplete   602 0.05993031
      # 2     A   Complete  9443 0.94006969
      # 3     B Incomplete  1425 0.12205567
      # 4     B   Complete 10250 0.87794433
      # 5     C Incomplete  1347 0.17194281
      # 6     C   Complete  6487 0.82805719
      # 7     D Incomplete  1118 0.21986234
      # 8     D   Complete  3967 0.78013766
      # 9     E Incomplete   715 0.26849418
      # 10    E   Complete  1948 0.73150582
      

      【讨论】:

        猜你喜欢
        • 2013-08-15
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-01-08
        • 1970-01-01
        • 1970-01-01
        • 2021-10-12
        相关资源
        最近更新 更多