如何在 R 中创建分布矩阵答案

【问题标题】：How to create matrix of distribution in R如何在 R 中创建分布矩阵
【发布时间】：2021-07-30 17:54:45
【问题描述】：

我在 R 中有以下数据框。

ID         Date                  List             Type
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      TR1
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      RES
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      FTT
P-10013    2020-04-12 17:10:05                    TR1
P-10013    2020-04-12 17:10:05                    FTT
P-10013    2020-04-12 17:10:05                    ZXR
P-10014    2020-04-10 04:30:19   ABD,BCD          TR1
P-10014    2020-04-10 04:30:19   ABD,BCD          ZXR
P-10015    2020-04-10 14:13:15   ABC              
P-10016    2020-04-10 13:13:15   
P-10017    2020-03-18 10:13:15   ABC,ABD,BCD      TR1



dput(df)
df <- structure(list(ID = c("P-10012", "P-10012", 
"P-10012", "P-10013", "P-10013", "P-10013", 
"P-10014", "P-10014", "P-10015", "P-10016", 
"P-10017"), Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15", 
"2020-04-15 12:13:15", "2020-04-12 17:10:05", "2020-04-12 17:10:05", 
"2020-04-12 17:10:05", "2020-04-10 04:30:19", "2020-04-10 04:30:19", 
"2020-04-10 14:13:15", "2020-04-10 13:13:15", "2020-03-18 10:13:15"
), Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"), List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD", 
"", "", "", "ABD,BCD", "ABD,BCD", "ABC", "", "ABC,ABD,BCD")), class = "data.frame", row.names = c(NA, 
-11L))

数据框的结构是，如果该特定ID 有多个可用行，那么它对于特定ID 将始终具有相同的List 值，因为它在Type 中有多个不同的值.如果对于特定的 ID 只有 1 个 Type 值，那么它将始终只有一行。

我需要为Apr-20 的List 值和Type 值创建以下分布。

其中，我的Required Df 中的前 7 行是 ID 的不同计数，具体取决于条件（即 List 或 Type 是否为空白）以及所有唯一 List 和 @ 的分布987654336@ 值。对于这 7 行，Distinct_Count 应除以Total 以获得Percentage。但是，从第 8 行开始，如果唯一值是 List 的形式，那么它应该除以 Non_Blank_List 的总非重复计数，如果该值来自 Type，那么它应该除以 @987654343 的总非重复计数@。

在下面的矩阵中，我只是想了解List和Type的唯一值的分布情况以及与其他值的组合。

请注意，出于示例目的，我已分别将 List 和 Type 值简化为 3 和 4 个唯一值，但在我的实际数据框中，它非常高，而且每个月都在变化，所以请不要对值进行硬编码。

我尝试了多种方法，但仍无法达到所需的输出。

必需的 Df

APR-21           Distinct_Count    Percentage    ABC     ABD      BCD     TR1    RES    FTT     ZXR
Total_ID         5                 100.00%       2       2        2       3      1      2       2 
Blank_List       2                 40.00%        0       0        0       1      0      1       1
Blank_Type       2                 40.00%        1       0        0       0      0      0       0
Both_Blank       1                 20.00%        0       0        0       0      0      0       0
Non_Blank_List   3                 60.00%        2       2        2       2      1      1       1         
Non_Blank_Type   3                 60.00%        1       2        2       3      1      1       2
Both_Non_Blank   2                 40.00%        1       2        2       2      1      1       1
ABC              1                 33.33%        2       1        1       1      1      1       0
ABD              0                  0.00%        1       2        2       2      1      1       1
BCD              0                  0.00%        1       2        2       2      1      1       1
TR1              0                  0.00%        1       2        2       3      1      1       1 
RES              0                  0.00%        1       1        1       1      1      1       0    
FTT              0                  0.00%        1       1        1       2      1      2       1 
ZXR              0                  0.00%        0       1        1       1      0      1       2

【问题讨论】：

@akrun- 更新了dput
ID 是否总是具有相同的Date 值？
@DanChaltiel - 是的，唯一的ID 将具有相同的日期。
您确定“Both_Non_Blank”之外的Distinct_Count 吗？只有一行有 ABC，其他列似乎考虑了所有包含 ABC 的行，有或没有其他。
@DanChaltiel - 是的，第 2 列 Distinct_Count 给出了特定给定条件或值（即 ABC、ABD 等）的不同计数，可用于唯一的 ID。而第 2 列之外的列是组合计数以及行中提到的值。

标签： r dataframe dplyr tidyverse

【解决方案1】：

最大的挑战是键以行和列的形式出现。

我使用了 2 个自定义函数来计算出现次数：

getcount()，它使用条件参数（如果您不熟悉 quosures，请参阅 here），用于您的特殊条件行（TotalID、Blank_list、...）
getcount2()，它使用简单的字符参数，用于您的案例行（ABC、BCD、TR1...）

两者的工作方式大致相同。我们分别计算所有单个案例和总数的计数，始终按ID 分组并连接结果。

代码如下：

library(tidyverse)
library(lubridate)

df <- structure(list(ID = c("P-10012", "P-10012", "P-10012", "P-10013", "P-10013", "P-10013", 
                            "P-10014", "P-10014", "P-10015", "P-10016", "P-10017"), 
                     Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15", "2020-04-15 12:13:15", 
                              "2020-04-12 17:10:05", "2020-04-12 17:10:05", "2020-04-12 17:10:05", 
                              "2020-04-10 04:30:19", "2020-04-10 04:30:19", "2020-04-10 14:13:15", 
                              "2020-04-10 13:13:15", "2020-03-18 10:13:15"), 
                     Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"),
                     List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD", "", "", "", "ABD,BCD", 
                              "ABD,BCD", "ABC", "", "ABC,ABD,BCD")), 
                class = "data.frame", row.names = c(NA, -11L))

#extract all the individual values from Type and List
cases = c(df$Type, str_split(df$List, ", ?", simplify=TRUE)) %>% unique() %>% 
  sort() %>% .[.!=""] %>% rlang::set_names()

#util function
is_blank = function(x) is.na(x) | x==""

#get count for summary rows (TotalID, Blank_list, ...)
getcount = function(cond){
  x = map_dbl(cases, ~df %>% 
                filter(month(Date)==4) %>%
                group_by(ID) %>% 
                summarise(rtn=any({{cond}} & (str_detect(Type, .x) | str_detect(List, .x)))) %>% 
                pull() %>% sum(na.rm=TRUE)
  )
  x_tot = df %>% 
    filter(month(Date)==4) %>% 
    group_by(ID) %>% 
    summarise(rtn=any({{cond}})) %>% 
    pull() %>% sum(na.rm=TRUE)
  
  c(x_tot, x)
}

#get count for cases rows (ABC, BCD, TR1...)
getcount2 = function(key){
  x = map_dbl(cases, ~df %>% 
                filter(month(Date)==4) %>%
                group_by(ID) %>% 
                summarise(rtn=any(
                  (key %in% Type  | str_detect(List, key)) &
                    (str_detect(Type, .x ) | str_detect(List, .x ))
                )) %>% 
                pull() %>% sum(na.rm=TRUE)
  )
  x_tot = df %>% 
    filter(month(Date)==4) %>% 
    group_by(ID) %>% 
    summarise(rtn=any(List==key)) %>% 
    pull() %>% sum(na.rm=TRUE)
  
  c(tot=x_tot, x)
}


#here we go!
tibble(x=c("Distinct_Count", cases)) %>% 
  mutate(
    Total_ID=getcount(TRUE),
    Blank_List=getcount(is_blank(List)),
    Blank_Type=getcount(is_blank(Type)),
    Blank_Both=getcount(is_blank(List) & is_blank(Type)),
    Non_Blank_List=getcount(!is_blank(List)),
    Non_Blank_Type=getcount(!is_blank(Type)),
    Non_Blank_Both=getcount(!is_blank(List) & !is_blank(Type))
  ) %>% 
  bind_cols(map_dfc(cases, ~getcount2(.x))) %>% 
  column_to_rownames("x") %>% 
  t() %>% as.data.frame() %>% 
  mutate(Percentage = scales::percent(Distinct_Count/max(Distinct_Count)), .after="Distinct_Count")
#>                Distinct_Count Percentage ABC ABD BCD FTT RES TR1 ZXR
#> Total_ID                    5       100%   2   2   2   2   1   3   2
#> Blank_List                  2        40%   0   0   0   1   0   1   1
#> Blank_Type                  2        40%   1   0   0   0   0   0   0
#> Blank_Both                  1        20%   0   0   0   0   0   0   0
#> Non_Blank_List              3        60%   2   2   2   1   1   2   1
#> Non_Blank_Type              3        60%   1   2   2   2   1   3   2
#> Non_Blank_Both              2        40%   1   2   2   1   1   2   1
#> ABC                         1        20%   2   1   1   1   1   1   0
#> ABD                         0         0%   1   2   2   1   1   2   1
#> BCD                         0         0%   1   2   2   1   1   2   1
#> FTT                         0         0%   1   1   1   2   1   2   1
#> RES                         0         0%   1   1   1   1   1   1   0
#> TR1                         0         0%   1   2   2   2   1   3   2
#> ZXR                         0         0%   0   1   1   1   0   2   2

^{由reprex package (v2.0.0) 于 2021 年 5 月 12 日创建}

请注意，与您的预期输出存在一些细微差别，但我认为它们是您复杂示例中的小错误。例如，ABC 有 Distinct_Count==1，所以在 5 个中它不应该占 33%。此外，在 TR1 中可以看到 ZXR 两次（ID 13 和 14）。

【讨论】：

非常感谢您的帮助:)