【问题标题】:How to classify variable count on multiple condition in R如何在R中对多个条件下的变量计数进行分类
【发布时间】:2021-02-11 09:06:51
【问题描述】:

我有以下 DF:

DF

ID       Var1         Var2              Type
IR-1     A1           X1,X2,X3          New
IR-2                                    Old
IR-3     A2           X1,X4             New
IR-4     A1           X1,X2,X3          New
IR-4     A3           X1,X2,X3          New
IR-4     A2           X1,X2,X3          New
IR-5     A1           X1,X3             New
IR-5     A2           X1,X3             New
IR-5     A3           X1,X3             New
IR-6                                    New
IR-7     A2           X1,X2,X3          New
IR-8                  X1,X2,X3          New
IR-9     A2           X8                New
IR-10                                   Old

需要的输出

Variables   Excl_Count   %         A1   A2   A3   X1   X2   X3   X4   GT   XN   XP   X8   KP   KL  
Total       10           100.00%   3    5    2    6    4    5    1    0    0    0    1    0    0
Blank_Var1  4             40.00%   0    0    0    1    1    1    0    0    0    0    0    0    0
Blank_Var2  3             30.00%   0    0    0    0    0    0    0    0    0    0    0    0    0
Blank_Both  3             30.00%   0    0    0    0    0    0    0    0    0    0    0    0    0
Blank_New   1             33.33%   0    0    0    0    0    0    0    0    0    0    0    0    0
Blank_Old   2             66.66%   0    0    0    0    0    0    0    0    0    0    0    0    0
Non_Blank   7             70.00%   3    5    2    6    4    5    1    0    0    0    1    0    0

通过使用df,我想了解Var1Var1Var2 之间的分布,组合一个独特的ID

在哪里,

  • Total = 唯一计数 ID 和变量命中的水平行计数(即 var1 和 var2),包括 ID
  • Excl_count = 如果一个特定的ID 只有一个值作为Var1Var2 的一部分
  • Blank_Var1 = 唯一 ID 计数,其中 Var1 为 Null/NA/Blank 或 0
  • Blank_Var2 = 唯一 ID 计数,其中 Var2 为 Null/NA/Blank 或 0
  • Blank_Both = 唯一 ID 计数,其中 Var1Var2 均为 Null/NA/Blank 或 0
  • Blank_New = 唯一 ID 计数,其中 Var1Var2 要么为 Null/NA/Blank,要么为 0,Type = 新
  • Blank_Old = 唯一 ID 计数,其中 Var1Var2 均为 Null/NA/Blank 或 0,Type = Old
  • Non_Blank = 唯一 ID 计数,其中 Var1Var2 不是 Null/NA/Blank 或 0
  • A1 到 KL 是每行对应的计数。

以下是我尝试过但没有按预期工作的代码 -

library(RMySQL)
library(dplyr)
library(tidyverse)
    
# Count Total
    Total <- DF %>%
      dplyr::group_by(ID) %>%
      dplyr::mutate(count = n())
    # Excl_Count 
    Excl_Count  <- DF %>% 
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = n_distinct(ID))
    # Blank_Var1
    Blank_Var1 <- DF %>% dplyr::filter(Var1 ==  '') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_Var2
    Blank_Var2 <- DF %>% dplyr::filter(Var2 ==  '') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_Both
    Blank_Both <- DF %>% dplyr::filter(Var1 ==  '' & Var2 == '') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_New
    Blank_New <- DF %>% dplyr::filter(Var1 ==  '' & Type == 'New') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))
    # Blank_Old
    Blank_Old <- DF %>% dplyr::filter(Var1 ==  '' & Type == 'Old') %>%
      dplyr::group_by(ID) %>%
      dplyr::summarize("Count" = sum(count))

输入

structure(list(ID = c("IR-1", "IR-2", "IR-3", "IR-4", "IR-4", 
"IR-4", "IR-5", "IR-5", "IR-5", "IR-6", "IR-7", "IR-8", "IR-9", 
"IR-10"), Var1 = c("A1", "", "A2", "A1", "A2", "A3", "A1", "A2", 
"A3", "", "A2", "", "A2", ""), Var2 = c("X1,X2,X3", "", "X1,X4", 
"X1,X2,X3", "X1,X2,X3", "X1,X2,X3", "X1,X3", "X1,X3", "X1,X3", 
"", "X1,X2,X3", "X1,X2,X3", "X8", ""), Type = c("New", "Old", 
"New", "New", "New", "New", "New", "New", "New", "New", "New", 
"New", "New", "Old")), class = "data.frame", row.names = c(NA, 
-14L))

【问题讨论】:

  • 你能解释一下到目前为止你尝试了什么以及你卡在哪里
  • 我很确定,您至少可以自己达到您想要的条件之一。所以请在寻求帮助之前先表现出你的努力。
  • @maydin-更新了我尝试过的代码,但不起作用。
  • 我无法理解!为什么A1Total 是2?为什么X1 6?太混乱了?您的代码从第一步本身就错了? DISTINCT 是什么?
  • GTXPKP 等是什么?

标签: r dataframe tidyverse


【解决方案1】:

创建三个中间对象(df1df2df3),然后进行如下操作

#load libraries
library(tidyverse)

修改后的输出

df <- structure(list(ID = c("IR-1", "IR-2", "IR-3", "IR-4", "IR-4", 
                            "IR-4", "IR-5", "IR-5", "IR-5", "IR-6", "IR-7", "IR-8", "IR-9", 
                            "IR-10"), Var1 = c("A1", "", "A2", "A1", "A2", "A3", "A1", "A2", 
                                               "A3", "", "A2", "", "A2", ""), Var2 = c("", "", "X1,X4", 
                                                                                       "X1,X2,X3", "X1,X2,X3", "X1,X2,X3", "X1,X3", "X1,X3", "X1,X3", 
                                                                                       "", "X1,X2,X3", "X1,X2,X3", "X8", ""), Type = c("New", "Old", 
                                                                                                                                       "New", "New", "New", "New", "New", "New", "New", "New", "New", 
                                                                                                                                       "New", "New", "Old")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                  -14L))

> df
      ID Var1     Var2 Type
1   IR-1   A1           New
2   IR-2                Old
3   IR-3   A2    X1,X4  New
4   IR-4   A1 X1,X2,X3  New
5   IR-4   A2 X1,X2,X3  New
6   IR-4   A3 X1,X2,X3  New
7   IR-5   A1    X1,X3  New
8   IR-5   A2    X1,X3  New
9   IR-5   A3    X1,X3  New
10  IR-6                New
11  IR-7   A2 X1,X2,X3  New
12  IR-8      X1,X2,X3  New
13  IR-9   A2       X8  New
14 IR-10                Old

在上面修改后的数据中,我为var2清空了一行ID-1

代码

(假设Var2 中最多三个标志)否则相应地修改separate 参数

df1 <- df %>% 
  group_by(Var1) %>%
  mutate(Total = n_distinct(ID),
         Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
         Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
         Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
         Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
         Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
         non_blank = Total - Blank_Both) %>%
  select(-c(ID, Var2, Type)) %>%
  filter(!(is.na(Var1) | Var1 == "" | Var1 == "0")) %>%
  pivot_longer(-Var1) %>%
  pivot_wider(id_cols = name, names_from = Var1, values_from = "value", values_fn = min) %>%
  ungroup()

# Check that Blank_var2 values aren't empty
# A tibble: 7 x 4
  name          A1    A2    A3
  <chr>      <int> <int> <int>
1 Total          3     5     2
2 Blank_var1     0     0     0
3 Blank_var2     1     0     0
4 Blank_Both     0     0     0
5 Blank_new      0     0     0
6 Blank_old      0     0     0
7 non_blank      3     5     2

#Second
  
df2 <- df %>% separate(Var2, into = paste0("Var2", 1:3), sep = ",") %>%
  pivot_longer(cols = c(Var21, Var22, Var23), names_to = "name", values_to = "Var2") %>%
  select(-name) %>%
  filter(!(is.na(Var2) | Var2 == "")) %>%
  group_by(Var2) %>%
  mutate(Total = n_distinct(ID),
         Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
         Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
         Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
         Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
         Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
         non_blank = Total - Blank_Both) %>%
  select(-c(ID, Var1, Type)) %>%
  pivot_longer(-Var2) %>%
  pivot_wider(id_cols = name, names_from = Var2, values_from = "value", values_fn = min)

# Check that blank_var1 isn't empty this time
# A tibble: 7 x 6
  name          X1    X4    X2    X3    X8
  <chr>      <int> <int> <int> <int> <int>
1 Total          5     1     3     4     1
2 Blank_var1     1     0     1     1     0
3 Blank_var2     0     0     0     0     0
4 Blank_Both     0     0     0     0     0
5 Blank_new      0     0     0     0     0
6 Blank_old      0     0     0     0     0
7 non_blank      5     1     3     4     1

df3 <- df %>%
  summarise(Total = n_distinct(ID),
         Blank_var1 = n_distinct(ID[is.na(Var1) | Var1 == "" | Var1 == "0"]),
         Blank_var2 = n_distinct(ID[is.na(Var2) | Var2 == "" | Var2 == "0"]),
         Blank_Both = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0")]),
         Blank_new = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "New")]),
         Blank_old = n_distinct(ID[(is.na(Var1) | Var1 == "" | Var1 == "0") & (is.na(Var2) | Var2 == "" | Var2 == "0") & (Type == "Old")]),
         non_blank = Total - Blank_Both) %>% pivot_longer(cols = 1:7, names_to = "Variable", values_to = "Excl_count") %>%
  mutate(`%` = case_when(Variable == "Total" ~ "100.00%",
                         Variable %in% c("Blank_var1", "Blank_var2", "Blank_Both", "non_blank") ~ paste0(round(Excl_count*100/Excl_count[Variable == "Total"], 2), "%"),
                         Variable == "Blank_new" | Variable == "Blank_old" ~ paste0(round(Excl_count*100/Excl_count[Variable == "Blank_Both"], 2), "%")))

> df3
# A tibble: 7 x 3
  Variable   Excl_count `%`    
  <chr>           <int> <chr>  
1 Total              10 100.00%
2 Blank_var1          4 40%    
3 Blank_var2          4 40%    
4 Blank_Both          3 30%    
5 Blank_new           1 33.33% 
6 Blank_old           2 66.67% 
7 non_blank           7 70%

最后,merge 三者都获得了这个..

merge(df3, merge(df1, df2, by.x = "name", by.y = "name", sort = F), 
      by.x = "Variable", by.y = "name", sort = F)

    Variable Excl_count       % A1 A2 A3 X1 X4 X2 X3 X8
1      Total         10 100.00%  3  5  2  5  1  3  4  1
2 Blank_var1          4     40%  0  0  0  1  0  1  1  0
3 Blank_var2          4     40%  1  0  0  0  0  0  0  0
4 Blank_Both          3     30%  0  0  0  0  0  0  0  0
5  Blank_new          1  33.33%  0  0  0  0  0  0  0  0
6  Blank_old          2  66.67%  0  0  0  0  0  0  0  0
7  non_blank          7     70%  3  5  2  5  1  3  4  1

解释

  • 你必须在相似的行上变异 3 次
  • 首先由group_by Var1
  • 其次是 group_byVar2 上,但在将它们分开并旋转更长的时间后,将它们放入一列中
  • 最后/第三个没有任何分组(因此我使用了summarise
  • 基本上所有三个中间对象中的 mutate/summarise 中的参数完全相同并且复制/粘贴
  • 最后我使用merge from baseR(你可以选择使用left_join)

【讨论】:

  • Blank_var2 出了点问题,因为所有变量都不能为 0。非常感谢 :)
  • 我对@9​​87654341@ 做了一些编辑,以向您展示代码是正确的。事实上,Var2 标志 blank_var2 将始终为 0 并且类似地对于 Var1 标志 blank_var1 值将始终为 0
猜你喜欢
  • 1970-01-01
  • 2018-07-16
  • 2021-09-26
  • 2021-06-23
  • 2019-05-23
  • 1970-01-01
  • 1970-01-01
  • 2023-02-03
  • 2021-11-03
相关资源
最近更新 更多