【问题标题】:How to get exclusive count in R dataframe如何在 R 数据框中获取独占计数
【发布时间】:2021-02-02 20:04:40
【问题描述】:

我在 R 中有以下提到的数据框:

DF <- tibble::tribble(
    ~ID, ~Check,
  "I-1",   "A1",
  "I-2",   "A2",
  "I-2",   "OT",
  "I-2",   "LP",
  "I-3",   "A1",
  "I-3",   "A2",
  "I-4",     NA,
  "I-5",     NA,
  "I-6",   "A1",
  "I-6",   "OT",
  "I-7",   "A2"
  )

DF2 <- tibble::tribble(
    ~ID,     ~Remarks,
  "I-1", "{X1,XR,XT}",
  "I-2",    "{X2,XR}",
  "I-3",           NA,
  "I-4", "{X1,XR,X2}",
  "I-5",       "{X1}",
  "I-6",       "{XT}",
  "I-7",    "{X1,X2}"
  )

使用上面提到的两个数据框,我需要以下格式的输出:

我想确定每个唯一 ID 的 CheckRemark 的独占计数,以及每个 Check 与另一个 Check 的组合以及与 Remark 类似的组合。

注意 - 基于Exclusive_CountCheck,行的顺序应该从高到低。在我的实际数据框中,唯一 CheckRemark 的数量很可能会有所不同。 (即 10 个独特的 Remark 和 5 个 Check,类似这样)

DF_输出

Remark   Exclusive_Count  %       X1  X2  XR  XT  Check  Exclusive_Count  %          A1  A2  OT  LP
Blank    1               33.33%   0   0   0   0   Blank  2                50.00%     0   0   0   0
X1       1               33.33%   0   2   2   1   A1     1                25.00      0   1   1   0
X2       0               0.00%    2   0   1   0   A2     1                25.00%     1   0   1   1
XR       0               0.00%    2   2   0   1   OT     0                0.00%      1   1   0   1
XT       1               33.33%   1   0   1   0   LP     0                0.00%      0   1   1   0
Total    3               100.00%  5   4   4   2   Total  4                100.00%    2   3   3   2
                                               

【问题讨论】:

  • 到目前为止你尝试过什么?顺便说一句,了解 Stack 推荐系统给你的标签:我在这里看不到任何关于图形的标签,所以ggplot2 似乎是多余的(如果我'我错了)。
  • 只有一个Blank 对应Remarks 为什么Exclusive_count 对应Blank2
  • @SinhNguyen:对不起,这是我的错...更正了。
  • 什么规则定义了Blankremark 与BlankCheck 对齐,X1remark 与A1Check 对齐等等?

标签: r dataframe


【解决方案1】:

OP 已请求规范答案。因此,我创建了一个 函数 get_exclusive_counts(),它采用任何 tibble、data.frame 或 data.table 的前两列,其中第一列包含 IDs 和第二列包含 有效负载,例如,Check,采用长格式。

该函数独立于列名,可以处理 payload 列中任意数量的不同项目。它为每个输入小标题返回一个 data.table:

get_exclusive_counts(DF)
    Check Exclusive_Count       % A1 A2 LP OT
1:  Blank               2  50.00%  0  0  0  0
2:     A1               1  25.00%  0  1  0  1
3:     A2               1  25.00%  1  0  1  1
4:     LP               0   0.00%  0  1  0  1
5:     OT               0   0.00%  1  1  1  0
6: Totals               4 100.00%  2  3  2  3

对于第二个用例DF2payload 需要预先分成单独的行:

library(magrittr)
DF2 %>% 
  dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>% 
  tidyr::separate_rows(Remarks) %>% 
  get_exclusive_counts() 
   Remarks Exclusive_Count       % X1 X2 XR XT
1:   Blank               1  33.33%  0  0  0  0
2:      X1               1  33.33%  0  2  2  1
3:      XT               1  33.33%  1  0  1  0
4:      X2               0   0.00%  2  0  2  0
5:      XR               0   0.00%  2  2  0  1
6:  Totals               3 100.00%  5  4  5  2

请注意,结果表的第一列的名称已从输入 data.frame 中保留。

OP 提到RemarksCheck 的数量可能不同。因此,cbind() 这两个结果表实际上没有任何意义,因为这只会在行数相同的情况下给出合理的结果。

另外,OP 的预期结果有一些列名重复(至少Exclusive_Count%,也许更多),这表明结果可能不会用于进一步处理,而仅用于显示/打印。

并排打印结果

但是,我创建了一个函数 get_exclusive_counts_side_by_side(),它打印调用 get_exclusive_counts() 的结果

  • 对于任意数量的输入数据集,
  • 具有不同的行数,并且
  • 最后一行 (Totals) 对齐。

该函数返回一个带有字符列的data.table。

下面的调用将重现 OP 的预期结果:

get_exclusive_counts_side_by_side(
  DF2 %>% 
    dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>% 
    tidyr::separate_rows(Remarks),
  DF)
   Remarks Exclusive_Count       % X1 X2 XR XT  Check Exclusive_Count       % A1 A2 LP OT
1:   Blank               1  33.33%  0  0  0  0  Blank               2  50.00%  0  0  0  0
2:      X1               1  33.33%  0  2  2  1     A1               1  25.00%  0  1  0  1
3:      XT               1  33.33%  1  0  1  0     A2               1  25.00%  1  0  1  1
4:      X2               0   0.00%  2  0  2  0     LP               0   0.00%  0  1  0  1
5:      XR               0   0.00%  2  2  0  1     OT               0   0.00%  1  1  1  0
6:  Totals               3 100.00%  5  4  5  2 Totals               4 100.00%  2  3  2  3

这是另一个用例,用于证明它可以处理不同的行和任意数量的输入数据集:

get_exclusive_counts_side_by_side(
  DF, 
  DF3 %>% 
    dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>% 
    tidyr::separate_rows(Remarks),
  DF)
    Check Exclusive_Count       % A1 A2 LP OT Remarks Exclusive_Count       % X1 X2 XR XT Y2 Y3 Y4  Check Exclusive_Count       % A1 A2 LP OT
1:  Blank               2  50.00%  0  0  0  0      X1               2  50.00%  0  2  2  1  1  1  0  Blank               2  50.00%  0  0  0  0
2:     A1               1  25.00%  0  1  0  1   Blank               1  25.00%  0  0  0  0  0  0  0     A1               1  25.00%  0  1  0  1
3:     A2               1  25.00%  1  0  1  1      XT               1  25.00%  1  0  1  0  0  0  0     A2               1  25.00%  1  0  1  1
4:     LP               0   0.00%  0  1  0  1      X2               0   0.00%  2  0  2  0  0  0  0     LP               0   0.00%  0  1  0  1
5:     OT               0   0.00%  1  1  1  0      XR               0   0.00%  2  2  0  1  0  0  0     OT               0   0.00%  1  1  1  0
6:                                                 Y2               0   0.00%  1  0  0  0  0  1  1                                         
7:                                                 Y3               0   0.00%  1  0  0  0  1  0  0                                         
8:                                                 Y4               0   0.00%  0  0  0  0  1  0  0                                         
9: Totals               4 100.00%  2  3  2  3  Totals               4 100.00%  7  4  5  2  3  2  1 Totals               4 100.00%  2  3  2  3

函数定义

代码看起来相当庞大,但有一半是 cmets。所以,代码应该是不言自明的。

另外,大约一半的代码行是由于 OP 的附加要求,例如 % 列或 Totals 行。

get_exclusive_counts <- function(DF) {
  library(data.table)
  library(magrittr)
  # make copy of first 2 cols to preserve original attributes of DF
  DT <- as.data.table(DF[, 1:2])
  # retain original column names
  old <- colnames(DT)[1:2]
  # rename colnames in copy for convenience of programming
  setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
  # aggregate by id to find exclusive counts = ids with only one element
  tmp <- DT[, .N, keyby = id][N == 1L]
  # create table of exclusive counts by joining and aggregating
  excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>% 
    # append column of proportions, will be formatted after computing Totals
    .[, `%` := Exclusive_Count / sum(Exclusive_Count)]
  # anti-join to find remaining rows
  rem <- DT[!tmp, on = .(id)]
  # create co-occurrence matrix in long format by a self-join
  coocc <-   rem[rem, on = .(id), allow.cartesian = TRUE] %>% 
    # reshape to wide format and compute counts of co-occurrences w/o diagonals
    dcast(val ~ i.val, length, subset = .(val != i.val))
  # build final result table by merging both subresults
  merge(excl, coocc, by = "val", all = TRUE) %>% 
    # replace NA counts by 0 
    .[, lapply(.SD, nafill, fill = 0L), by = val] %>% 
    # clean-up: order by decreasing Exclusive_Counts %>% 
    .[order(-Exclusive_Count)] %>% 
    # append Totals row
    rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>% 
    # clean-up: format proportion as percentage
    .[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>% 
    # clean-up: Replace <NA> by "Blank" in val column
    .[is.na(val), val := "Blank"] %>%
    # rename val column
    setnames("val", old[2]) %>% 
    # return result visibly
    .[]
}

这是get_exclusive_counts_side_by_side()的代码:

get_exclusive_counts_side_by_side <- function(...) {
  library(data.table)
  library(magrittr)
  # process input, return list of subresults
  ec_list<- list(...) %>% 
    lapply(get_exclusive_counts)
  # create row indices for maximum rows
  rid <- ec_list %>% 
    lapply(nrow) %>%
    Reduce(max, .) %>% 
    {data.table(.rowid = 1:.)}
  # combine subresults 
  ec_list %>% 
    # insert empty rows if necessary
    lapply(function(.x) .x[
      , .rowid := .I][
        # but align last row
        .rowid == .N, .rowid := nrow(rid)][
          rid, on =.(.rowid)][
            , .rowid := NULL]
    ) %>%  
    # all data.tables have the same number of rows, now cbind()
    do.call(cbind, .) %>% 
    # replace all NA by empty character strings
    .[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
}

补充说明

如果我理解正确,独占计数 是指ID 仅分配了一项(或NA)。这是相当直接的计算方式

  1. 统计每个ID的项目数,
  2. 只选择一项ID
  3. 选择输入 data.frame 中属于那些 IDs 的行(使用连接),并且
  4. 计算项目在独占行子集中的出现次数。

此外,该函数处理 OP 的附加要求,这些要求超出了排他计数的识别

  • 添加剩余的非排他的共现计数矩阵 行,
  • 在特定位置添加一列排他计数的比例并将其格式化为百分比,
  • 添加 Totals 行,
  • 分别将NAs 替换为零或"Blank"

数据

DF <- tibble::tribble(
  ~ID, ~Check,
  "I-1",   "A1",
  "I-2",   "A2",
  "I-2",   "OT",
  "I-2",   "LP",
  "I-3",   "A1",
  "I-3",   "A2",
  "I-4",     NA,
  "I-5",     NA,
  "I-6",   "A1",
  "I-6",   "OT",
  "I-7",   "A2"
)

DF2 <- tibble::tribble(
  ~ID,     ~Remarks,
  "I-1", "{X1,XR,XT}",
  "I-2",    "{X2,XR}",
  "I-3",           NA,
  "I-4", "{X1,XR,X2}",
  "I-5",       "{X1}",
  "I-6",       "{XT}",
  "I-7",    "{X1,X2}"
)

DF3 <- tibble::tribble(
  ~ID,     ~Remarks,
  "I-1", "{X1,XR,XT}",
  "I-2",    "{X2,XR}",
  "I-3",           NA,
  "I-4", "{X1,XR,X2}",
  "I-5",       "{X1}",
  "I-6",       "{XT}",
  "I-7",    "{X1,X2}",
  "I-8", "{X1,Y2,Y3}",
  "I-9",    "{Y2,Y4}",
  "I10",       "{X1}",
)

【讨论】:

    【解决方案2】:

    我认为这可以满足您的需求...可能不是最简洁的,但似乎可以解决问题。

    # Load Library
    library('tidyverse')
    
    ### CHECK ###
    # Load Check Table
    DF <- tibble::tribble(
      ~ID, ~Check,
      "I-1",   "A1",
      "I-2",   "A2",
      "I-2",   "OT",
      "I-2",   "LP",
      "I-3",   "A1",
      "I-3",   "A2",
      "I-4",     NA,
      "I-5",     NA,
      "I-6",   "A1",
      "I-6",   "OT",
      "I-7",   "A2"
    )
    
    # Count by ID
    DF <- DF %>%
      group_by(ID) %>%
      mutate(count = n())
    
    # Count by Check
    DF_X <- DF %>% dplyr::filter(count ==  1) %>%
      group_by(Check) %>%
      dplyr::summarize("Count" = sum(count))
    
    # Identify unique values of Check
    DF_UNIQUE <- unique(DF$Check)
    DF_FIN <- data.frame("Check" = DF_UNIQUE,stringsAsFactors = FALSE)
    
    # Join Counts by Check with unique list of Checks
    DF_FIN <- left_join(x = DF_FIN, y = DF_X, by = "Check")
    
    # Replace NA's with zeros
    DF_FIN[is.na(DF_FIN$Count),2] <- 0
    
    # Calculate Percentages
    DF_FIN <- DF_FIN  %>%
        mutate("Check Percentage" = `Count`/sum(`Count`))
    
    # Rename Columns
    colnames(DF_FIN) <- c("Check", "Exclusive Count", "Check Percentage")
    
    # Replace NA value with the word "BLANK"
    DF_FIN[is.na(DF_FIN$Check),1] <- "BLANK"
    
    # Sort by Exclusive Count and then by Check (alphabetical)
    DF_FIN <- DF_FIN %>%
      arrange(desc(`Exclusive Count`), Check)
    
    # Join Checks to itself and count instances
    DF_CHECKS <- full_join(x = DF, y = DF, by = "ID")
    
    DF_CHECKS <- DF_CHECKS %>%
      group_by(Check.x, Check.y) %>%
      dplyr::summarize("N" = n())
    
    DF_CHECKS_SPREAD <- DF_CHECKS %>% 
      tidyr::pivot_wider(names_from = Check.y, values_from = N)
    check_order <- DF_CHECKS_SPREAD$Check.x
    check_order[is.na(check_order)] <- 'NA'
    DF_CHECKS_SPREAD <- DF_CHECKS_SPREAD %>% select(check_order)
    
    # Set the diagonal to zeros
    for (i in 1:nrow(DF_CHECKS_SPREAD)){
      DF_CHECKS_SPREAD[i,i+1] <-0
    }
    
    # Rename Columns
    colnames(DF_CHECKS_SPREAD)[1] <- "Check"
    colnames(DF_CHECKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
    
    # Drop the BLANK column
    DF_CHECKS_SPREAD$BLANK <- NULL
    
    # Replace NA value with the word "BLANK"
    DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD$Check),1] <- "BLANK"
    
    # Replace all other NA's with zero
    DF_CHECKS_SPREAD[is.na(DF_CHECKS_SPREAD)] <- 0
    
    # Join the two Checks data sets together & calculate grand totals
    FINAL_TABLE_CHECKS <- left_join(x = DF_FIN, y = DF_CHECKS_SPREAD, by = "Check")
    FINAL_TABLE_CHECKS <- FINAL_TABLE_CHECKS %>%
      bind_rows(summarise(.,
                          across(where(is.numeric), sum),
                          across(where(is.character), ~"Total")))
    
    
    ### REMARKS ###
    # Load Remarks table
    DF2 <- tibble::tribble(
      ~ID,     ~Remarks,
      "I-1", "{X1,XR,XT}",
      "I-2",    "{X2,XR}",
      "I-3",           NA,
      "I-4", "{X1,XR,X2}",
      "I-5",       "{X1}",
      "I-6",       "{XT}",
      "I-7",    "{X1,X2}"
    )
    
    # Remove the {} from the Remarks string
    DF2$Remarks <- str_replace_all(string = DF2$Remarks, c("\\{" = "", "\\}" = ""))
    
    # Expand string into rows
    DF2 <- separate_rows(DF2, Remarks, convert = TRUE)
    
    # Group and count by ID
    DF2 <- DF2 %>%
      group_by(ID) %>%
      mutate(count = n())
    
    # Count by Remarks
    DF2_X <- DF2 %>% dplyr::filter(count ==  1) %>%
      group_by(Remarks) %>%
      dplyr::summarize("Count" = sum(count))
    
    # Identify unique Remarks
    DF2_UNIQUE <- unique(DF2$Remarks)
    DF2_FIN <- data.frame("Remarks" = DF2_UNIQUE,stringsAsFactors = FALSE)
    
    # Join count of Remarks with unique list of Remarks
    DF2_FIN <- left_join(x = DF2_FIN, y = DF2_X, by = "Remarks")
    
    # Replace NA's with zeros
    DF2_FIN[is.na(DF2_FIN$Count),2] <- 0
    
    # Calculate Percentages
    DF2_FIN <- DF2_FIN  %>%
      mutate("Remarks Percentage" = `Count`/sum(`Count`))
    
    # Rename columns
    colnames(DF2_FIN) <- c("Remarks", "Exclusive Count", "Remarks Percentage")
    
    # Replace NA value with the word "BLANK"
    DF2_FIN[is.na(DF2_FIN$Remarks),1] <- "BLANK"
    
    # Sort by Exclusive Count and then by Check (alphabetical)
    DF2_FIN <- DF2_FIN %>%
      arrange(desc(`Exclusive Count`), Remarks)
    
    # Join Remarks to itself and count instances
    DF_REMARKS <- full_join(x = DF2, y = DF2, by = "ID")
    DF_REMARKS <- DF_REMARKS %>%
      group_by(Remarks.x, Remarks.y) %>%
      dplyr::summarize("N" = n())
    DF_REMARKS_SPREAD <- DF_REMARKS %>% 
      tidyr::pivot_wider(names_from = Remarks.y, values_from = N)
    check_order <- DF_REMARKS_SPREAD$Remarks.x
    check_order[is.na(check_order)] <- 'NA'
    DF_REMARKS_SPREAD <- DF_REMARKS_SPREAD %>% select(check_order)
    
    # Set the diagonal to zeros
    for (i in 1:nrow(DF_REMARKS_SPREAD)){
      DF_REMARKS_SPREAD[i,i+1] <-0
    }
    
    # Rename Columns
    colnames(DF_REMARKS_SPREAD)[1] <- "Remarks"
    colnames(DF_REMARKS_SPREAD)[colnames(DF_CHECKS_SPREAD) == "NA"] <- "BLANK"
    
    # Drop the BLANK column
    DF_REMARKS_SPREAD$BLANK <- NULL
    
    # Replace NA value with the word "BLANK"
    DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD$Remarks),1] <- "BLANK"
    
    # Replace all other NA's with zero
    DF_REMARKS_SPREAD[is.na(DF_REMARKS_SPREAD)] <- 0
    
    # Join the two Remarks data sets together & calculate grand totals
    FINAL_TABLE_REMARKS <- left_join(x = DF2_FIN, y = DF_REMARKS_SPREAD, by = "Remarks")
    FINAL_TABLE_REMARKS <- FINAL_TABLE_REMARKS %>%
      bind_rows(summarise(.,
                          across(where(is.numeric), sum),
                          across(where(is.character), ~"Total")))
    
    # Count Rows in Check and Remarks dataframes and add rows in dataframe
    # with less rows to match # of rows in other.
    checkRows <- nrow(FINAL_TABLE_CHECKS)
    remarksRows <- nrow(FINAL_TABLE_REMARKS)
    rowDiff <- abs(checkRows - remarksRows)
    
    if(checkRows < remarksRows){
      cat("Adding", rowDiff , "rows to the Checks dataframe.\n\n")
      FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS)+rowDiff,] <- NA
      FINAL_TABLE_CHECKS[nrow(FINAL_TABLE_CHECKS),] <- FINAL_TABLE_CHECKS[checkRows,]
      FINAL_TABLE_CHECKS[checkRows,] <- NA
    }else if(remarksRows < checkRows){
      cat("Adding", rowDiff , "rows to the Remarks dataframe.\n\n")
      FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS)+rowDiff,] <- NA
      FINAL_TABLE_REMARKS[nrow(FINAL_TABLE_REMARKS),] <- FINAL_TABLE_REMARKS[remarksRows,]
      FINAL_TABLE_REMARKS[remarksRows,] <- NA
    }else{
      print("There is no difference in number of rows between Checks and Remarks.\n\n")
    }
    
    
    # Combine columns from Checks and Remarks into one table.
    RESULTS <- cbind(FINAL_TABLE_REMARKS, FINAL_TABLE_CHECKS)
    RESULTS$`Check Percentage` <- paste(round(100*RESULTS$`Check Percentage`,2), "%", sep="")
    RESULTS$`Remarks Percentage` <- paste(round(100*RESULTS$`Remarks Percentage`,2), "%", sep="")
    RESULTS
    

    【讨论】:

    • @C Jeruzal - 谢谢,spread 函数出现错误。错误是Error in DF_CHECKS %&gt;% spread(Check.y, N) : could not find function "spread"
    • @user9211845 - 是的,我忘记了 spread 已从 tidyr 包中退休。我更新了代码以改用tidyr::pivot_wider。现在应该对你有用。
    • 我还在pivot_wider 之后添加了一个列排序,因为列位置发生了变化。现在它们应该与描述匹配。备注或检查的顺序。
    • 还添加了总百分比计算和调整百分比列名称。
    • @C Jeruzal - 非常感谢,RESULTS &lt;- cbind(FINAL_TABLE_REMARKS, FINAL_TABLE_CHECKS) 出现错误,因为行数不同。我如何cbind 不同的行数?
    猜你喜欢
    • 2021-07-12
    • 1970-01-01
    • 1970-01-01
    • 2021-07-12
    • 2016-07-08
    • 2020-08-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多