【问题标题】:How to summarize the top n values across multiple columns row wise?如何按行汇总多列中的前 n 个值?
【发布时间】:2022-01-16 13:12:23
【问题描述】:

在我的数据框中,我有多个包含学生成绩的列。我想总结“测验”列(例如,Quiz1、Quiz2)。但是,我只想总结前 2 个值,而忽略其他值。我想用总数(即前 2 个值的总和)创建一个新列。还有一个问题是成绩与给定行中的前 2 个成绩并列。例如,Aaron 的最高分是 42,但有两个分数并列第二高(即 36)。

数据

df <- 
  structure(
  list(
    Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                "Timothy", "Grant", "Chloe", "Judy", "Justin"),
    ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
           98432, 19487, 94029),
    Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
    Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
    Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
    Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
  ),
  row.names = c(NA, -10L),
  class = c("tbl_df", "tbl", "data.frame")
)

我知道我可以使用pivot_longer 来做到这一点,这样我就可以按组排列,然后为每个学生取前 2 个值。这很好用,但我觉得tidyverse 应该有一种更有效的方法,而不是来回转动。

我尝试过的

df %>%
  tidyr::pivot_longer(-c(Student, ID)) %>%
  dplyr::group_by(Student, ID) %>%
  dplyr::arrange(desc(value), .by_group = TRUE) %>%
  dplyr::slice_head(n = 2) %>%
  tidyr::pivot_wider(names_from = name, values_from = value) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))

我也知道,如果我想对每一行的所有列求和,那么我可以使用rowSums,就像我在上面使用的那样。但是,我不确定如何对 4 个测验列中的前 2 个值执行 rowSums

预期输出

# A tibble: 10 × 7
   Student      ID Quiz2 Quiz3 Quiz1 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    42    36    NA    NA    78
 2 Charlotte 61755    NA    NA    41    39    80
 3 Chloe     98432    49    NA    42    NA    91
 4 Grant     34096    48    44    NA    NA    92
 5 James     87311    NA    36    NA    43    79
 6 Judy      19487    NA    42    NA    41    83
 7 Justin    94029    NA    37    NA    41    78
 8 Katie     55323    NA    32    NA    46    78
 9 Olivia    94839    NA    43    NA    40    83
10 Timothy   38209    38    38    NA    NA    76

【问题讨论】:

  • 如果您对collapse 没意见,那么ftransform(gvr(df, "Student|ID"), dapply(gvr(df, "^Quiz"), MARGIN = 1, FUN = function(x) replace(x, radixorder(radixorder(x)) %in% 1:2, NA))) %&gt;% ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE)) 是一个快速的逐行选项

标签: r tidyverse


【解决方案1】:

另一个解决方案,基于tidyverse

library(tidyverse)

df %>% 
  rowwise %>% 
  mutate(Quiz = list(c_across(starts_with("Quiz")) * 
   if_else(rank(c_across(starts_with("Quiz")),ties.method="last")>=3,1,NA_real_)),
    across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>%
  unnest_wider(Quiz, names_sep = "") 

#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Aaron     30016    NA    42    36    NA    78
#>  2 James     87311    NA    NA    36    43    79
#>  3 Charlotte 61755    41    NA    NA    39    80
#>  4 Katie     55323    NA    NA    32    46    78
#>  5 Olivia    94839    NA    NA    43    40    83
#>  6 Timothy   38209    NA    38    38    NA    76
#>  7 Grant     34096    NA    48    44    NA    92
#>  8 Chloe     98432    42    49    NA    NA    91
#>  9 Judy      19487    NA    NA    42    41    83
#> 10 Justin    94029    NA    NA    37    41    78

【讨论】:

    【解决方案2】:

    正如上面提供的@akrun,collapse 是另一种有效的可能性。 radixorder 提供了一个整数排序向量,每行只保留前 2 个值,其余的替换为 NA。然后,rowSums 用于获取每一行的总数。

    library(collapse)
    
    ftransform(gvr(df, "Student|ID"),
               dapply(
                 gvr(df, "^Quiz"),
                 MARGIN = 1,
                 FUN = function(x)
                   replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
               )) %>%
      ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))
    

    输出

    # A tibble: 10 × 7
       Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
     * <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1 Aaron     30016    NA    42    NA    36    78
     2 James     87311    NA    NA    36    43    79
     3 Charlotte 61755    41    NA    NA    39    80
     4 Katie     55323    NA    NA    32    46    78
     5 Olivia    94839    NA    NA    43    40    83
     6 Timothy   38209    NA    NA    38    38    76
     7 Grant     34096    NA    48    44    NA    92
     8 Chloe     98432    NA    49    42    NA    91
     9 Judy      19487    NA    NA    42    41    83
    10 Justin    94029    NA    NA    37    41    78
    

    【讨论】:

      【解决方案3】:

      您不必执行pivot_wider。请注意,较长的格式是整洁的格式。只需执行pivot_longerleft_join

      df %>% 
        left_join(pivot_longer(., -c(Student, ID)) %>%
        group_by(Student, ID) %>%
        summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))
      
      # A tibble: 10 x 7
         Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
         <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
       1 Aaron     30016    31    42    36    36    78
       2 James     87311    25    33    36    43    79
       3 Charlotte 61755    41    34    34    39    80
       4 Katie     55323    10    22    32    46    78
       5 Olivia    94839    35    23    43    40    83
       6 Timothy   38209    19    38    38    38    76
       7 Grant     34096    27    48    44    43    92
       8 Chloe     98432    42    49    42    35    91
       9 Judy      19487    15    23    42    41    83
      10 Justin    94029    20    30    37    41    78
      

      【讨论】:

        【解决方案4】:

        试试这个base R也可以得到NAs

        cbind( df[,1:2], t( sapply( seq_along( 1:nrow( df ) ), function(x){
          ord <- order( unlist( df[x,3:6] ) )[1:2]; arow <- df[x,3:6]; 
          arow[ord] <- NA; ttl <- rowSums( arow[-ord], na.rm=T );
          cbind( arow,Total=ttl ) } ) ) )
        
             Student    ID Quiz1 Quiz2 Quiz3 Quiz4 Total
        1      Aaron 30016    NA    42    NA    36    78
        2      James 87311    NA    NA    36    43    79
        3  Charlotte 61755    41    NA    NA    39    80
        4      Katie 55323    NA    NA    32    46    78
        5     Olivia 94839    NA    NA    43    40    83
        6    Timothy 38209    NA    NA    38    38    76
        7      Grant 34096    NA    48    44    NA    92
        8      Chloe 98432    NA    49    42    NA    91
        9       Judy 19487    NA    NA    42    41    83
        10    Justin 94029    NA    NA    37    41    78
        

        【讨论】:

          【解决方案5】:

          (有点乱)Base R 解决方案:

          # Store the names of quiz columns as a vector: quiz_colnames => character vector
          quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)
          
          # Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
          non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]
          
          # Store an Idx based on the ID: Idx => integer vector:
          Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))
          
          # Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
          res <- data.frame(
            do.call(
              rbind,
              lapply(
                with(
                  df,
                  split(
                    df,
                    Idx 
                  )
                ),
                function(x){
                  # Extract the top 2 quiz vectors: top_2_quizes => named integer vector
                  top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
                  # Calculate the quiz columns not used: remainder_quiz_cols => character vector
                  remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
                  # Nullify the remaining quizes: x => data.frame 
                  x[, remainder_quiz_cols] <- NA_integer_
                  # Calculate the resulting data.frame: data.frame => env 
                  transform(
                    cbind(
                      x[,non_quiz_names], 
                      x[,names(top_2_quizes)],
                      x[,remainder_quiz_cols]
                    ),
                    Total = sum(top_2_quizes)
                  )[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
                }
              )
            ),
            row.names = NULL,
            stringsAsFactors = FALSE
          )
          

          【讨论】:

            【解决方案6】:

            使用基础 R - 仅选择测验结果列,您可以将其视为矩阵。应用降序排序,对前两个元素进行子集化,然后使用 colSums。

            df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))
            
            df
            #> # A tibble: 10 × 7
            #>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
            #>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
            #>  1 Aaron     30016    31    42    36    36    78
            #>  2 James     87311    25    33    36    43    79
            #>  3 Charlotte 61755    41    34    34    39    80
            #>  4 Katie     55323    10    22    32    46    78
            #>  5 Olivia    94839    35    23    43    40    83
            #>  6 Timothy   38209    19    38    38    38    76
            #>  7 Grant     34096    27    48    44    43    92
            #>  8 Chloe     98432    42    49    42    35    91
            #>  9 Judy      19487    15    23    42    41    83
            #> 10 Justin    94029    20    30    37    41    78
            

            【讨论】:

              【解决方案7】:

              Based on this StackOverflow answer.

              library(tidyverse)
              
              df <- 
                structure(
                  list(
                    Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                                "Timothy", "Grant", "Chloe", "Judy", "Justin"),
                    ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
                           98432, 19487, 94029),
                    Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
                    Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
                    Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
                    Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
                  ),
                  row.names = c(NA, -10L),
                  class = c("tbl_df", "tbl", "data.frame")
                )
              
              df %>%
                rowwise() %>% 
                mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>% 
                ungroup()
              #> # A tibble: 10 × 7
              #>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
              #>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
              #>  1 Aaron     30016    31    42    36    36         78
              #>  2 James     87311    25    33    36    43         79
              #>  3 Charlotte 61755    41    34    34    39         80
              #>  4 Katie     55323    10    22    32    46         78
              #>  5 Olivia    94839    35    23    43    40         83
              #>  6 Timothy   38209    19    38    38    38         76
              #>  7 Grant     34096    27    48    44    43         92
              #>  8 Chloe     98432    42    49    42    35         91
              #>  9 Judy      19487    15    23    42    41         83
              #> 10 Justin    94029    20    30    37    41         78
              

              【讨论】:

                猜你喜欢
                • 2018-12-17
                • 2023-03-12
                • 1970-01-01
                • 1970-01-01
                • 2022-01-12
                • 1970-01-01
                • 1970-01-01
                • 1970-01-01
                • 2015-12-09
                相关资源
                最近更新 更多