OP 已请求规范答案。因此,我创建了一个 函数 get_exclusive_counts(),它采用任何 tibble、data.frame 或 data.table 的前两列,其中第一列包含 IDs 和第二列包含 有效负载,例如,Check,采用长格式。
该函数独立于列名,可以处理 payload 列中任意数量的不同项目。它为每个输入小标题返回一个 data.table:
get_exclusive_counts(DF)
Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0
6: Totals 4 100.00% 2 3 2 3
对于第二个用例DF2,payload 需要预先分成单独的行:
library(magrittr)
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks) %>%
get_exclusive_counts()
Remarks Exclusive_Count % X1 X2 XR XT
1: Blank 1 33.33% 0 0 0 0
2: X1 1 33.33% 0 2 2 1
3: XT 1 33.33% 1 0 1 0
4: X2 0 0.00% 2 0 2 0
5: XR 0 0.00% 2 2 0 1
6: Totals 3 100.00% 5 4 5 2
请注意,结果表的第一列的名称已从输入 data.frame 中保留。
OP 提到Remarks 和Check 的数量可能不同。因此,cbind() 这两个结果表实际上没有任何意义,因为这只会在行数相同的情况下给出合理的结果。
另外,OP 的预期结果有一些列名重复(至少Exclusive_Count、%,也许更多),这表明结果可能不会用于进一步处理,而仅用于显示/打印。
并排打印结果
但是,我创建了一个函数 get_exclusive_counts_side_by_side(),它打印调用 get_exclusive_counts() 的结果
- 对于任意数量的输入数据集,
- 具有不同的行数,并且
- 最后一行 (
Totals) 对齐。
该函数返回一个带有字符列的data.table。
下面的调用将重现 OP 的预期结果:
get_exclusive_counts_side_by_side(
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Remarks Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 LP OT
1: Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
2: X1 1 33.33% 0 2 2 1 A1 1 25.00% 0 1 0 1
3: XT 1 33.33% 1 0 1 0 A2 1 25.00% 1 0 1 1
4: X2 0 0.00% 2 0 2 0 LP 0 0.00% 0 1 0 1
5: XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 1 0
6: Totals 3 100.00% 5 4 5 2 Totals 4 100.00% 2 3 2 3
这是另一个用例,用于证明它可以处理不同的行和任意数量的输入数据集:
get_exclusive_counts_side_by_side(
DF,
DF3 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Check Exclusive_Count % A1 A2 LP OT Remarks Exclusive_Count % X1 X2 XR XT Y2 Y3 Y4 Check Exclusive_Count % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0 X1 2 50.00% 0 2 2 1 1 1 0 Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1 Blank 1 25.00% 0 0 0 0 0 0 0 A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1 XT 1 25.00% 1 0 1 0 0 0 0 A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1 X2 0 0.00% 2 0 2 0 0 0 0 LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0 XR 0 0.00% 2 2 0 1 0 0 0 OT 0 0.00% 1 1 1 0
6: Y2 0 0.00% 1 0 0 0 0 1 1
7: Y3 0 0.00% 1 0 0 0 1 0 0
8: Y4 0 0.00% 0 0 0 0 1 0 0
9: Totals 4 100.00% 2 3 2 3 Totals 4 100.00% 7 4 5 2 3 2 1 Totals 4 100.00% 2 3 2 3
函数定义
代码看起来相当庞大,但有一半是 cmets。所以,代码应该是不言自明的。
另外,大约一半的代码行是由于 OP 的附加要求,例如 % 列或 Totals 行。
get_exclusive_counts <- function(DF) {
library(data.table)
library(magrittr)
# make copy of first 2 cols to preserve original attributes of DF
DT <- as.data.table(DF[, 1:2])
# retain original column names
old <- colnames(DT)[1:2]
# rename colnames in copy for convenience of programming
setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
# aggregate by id to find exclusive counts = ids with only one element
tmp <- DT[, .N, keyby = id][N == 1L]
# create table of exclusive counts by joining and aggregating
excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>%
# append column of proportions, will be formatted after computing Totals
.[, `%` := Exclusive_Count / sum(Exclusive_Count)]
# anti-join to find remaining rows
rem <- DT[!tmp, on = .(id)]
# create co-occurrence matrix in long format by a self-join
coocc <- rem[rem, on = .(id), allow.cartesian = TRUE] %>%
# reshape to wide format and compute counts of co-occurrences w/o diagonals
dcast(val ~ i.val, length, subset = .(val != i.val))
# build final result table by merging both subresults
merge(excl, coocc, by = "val", all = TRUE) %>%
# replace NA counts by 0
.[, lapply(.SD, nafill, fill = 0L), by = val] %>%
# clean-up: order by decreasing Exclusive_Counts %>%
.[order(-Exclusive_Count)] %>%
# append Totals row
rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>%
# clean-up: format proportion as percentage
.[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>%
# clean-up: Replace <NA> by "Blank" in val column
.[is.na(val), val := "Blank"] %>%
# rename val column
setnames("val", old[2]) %>%
# return result visibly
.[]
}
这是get_exclusive_counts_side_by_side()的代码:
get_exclusive_counts_side_by_side <- function(...) {
library(data.table)
library(magrittr)
# process input, return list of subresults
ec_list<- list(...) %>%
lapply(get_exclusive_counts)
# create row indices for maximum rows
rid <- ec_list %>%
lapply(nrow) %>%
Reduce(max, .) %>%
{data.table(.rowid = 1:.)}
# combine subresults
ec_list %>%
# insert empty rows if necessary
lapply(function(.x) .x[
, .rowid := .I][
# but align last row
.rowid == .N, .rowid := nrow(rid)][
rid, on =.(.rowid)][
, .rowid := NULL]
) %>%
# all data.tables have the same number of rows, now cbind()
do.call(cbind, .) %>%
# replace all NA by empty character strings
.[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
}
补充说明
如果我理解正确,独占计数 是指ID 仅分配了一项(或NA)。这是相当直接的计算方式
- 统计每个ID的项目数,
- 只选择一项ID,
- 选择输入 data.frame 中属于那些 IDs 的行(使用连接),并且
- 计算项目在独占行子集中的出现次数。
此外,该函数处理 OP 的附加要求,这些要求超出了排他计数的识别:
- 添加剩余的非排他的共现计数矩阵
行,
- 在特定位置添加一列排他计数的比例并将其格式化为百分比,
- 添加 Totals 行,
- 分别将
NAs 替换为零或"Blank"。
数据
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
DF3 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}",
"I-8", "{X1,Y2,Y3}",
"I-9", "{Y2,Y4}",
"I10", "{X1}",
)