【发布时间】:2021-02-11 09:06:51
【问题描述】:
我有以下 DF:
DF
ID Var1 Var2 Type
IR-1 A1 X1,X2,X3 New
IR-2 Old
IR-3 A2 X1,X4 New
IR-4 A1 X1,X2,X3 New
IR-4 A3 X1,X2,X3 New
IR-4 A2 X1,X2,X3 New
IR-5 A1 X1,X3 New
IR-5 A2 X1,X3 New
IR-5 A3 X1,X3 New
IR-6 New
IR-7 A2 X1,X2,X3 New
IR-8 X1,X2,X3 New
IR-9 A2 X8 New
IR-10 Old
需要的输出
Variables Excl_Count % A1 A2 A3 X1 X2 X3 X4 GT XN XP X8 KP KL
Total 10 100.00% 3 5 2 6 4 5 1 0 0 0 1 0 0
Blank_Var1 4 40.00% 0 0 0 1 1 1 0 0 0 0 0 0 0
Blank_Var2 3 30.00% 0 0 0 0 0 0 0 0 0 0 0 0 0
Blank_Both 3 30.00% 0 0 0 0 0 0 0 0 0 0 0 0 0
Blank_New 1 33.33% 0 0 0 0 0 0 0 0 0 0 0 0 0
Blank_Old 2 66.66% 0 0 0 0 0 0 0 0 0 0 0 0 0
Non_Blank 7 70.00% 3 5 2 6 4 5 1 0 0 0 1 0 0
通过使用df,我想了解Var1 在Var1 和Var2 之间的分布,组合一个独特的ID。
在哪里,
- Total = 唯一计数
ID和变量命中的水平行计数(即 var1 和 var2),包括ID - Excl_count = 如果一个特定的
ID只有一个值作为Var1或Var2的一部分 - Blank_Var1 = 唯一
ID计数,其中Var1为 Null/NA/Blank 或 0 - Blank_Var2 = 唯一
ID计数,其中Var2为 Null/NA/Blank 或 0 - Blank_Both = 唯一
ID计数,其中Var1和Var2均为 Null/NA/Blank 或 0 - Blank_New = 唯一
ID计数,其中Var1和Var2要么为 Null/NA/Blank,要么为 0,Type= 新 - Blank_Old = 唯一
ID计数,其中Var1和Var2均为 Null/NA/Blank 或 0,Type= Old - Non_Blank = 唯一
ID计数,其中Var1或Var2不是 Null/NA/Blank 或 0 - A1 到 KL 是每行对应的计数。
以下是我尝试过但没有按预期工作的代码 -
library(RMySQL)
library(dplyr)
library(tidyverse)
# Count Total
Total <- DF %>%
dplyr::group_by(ID) %>%
dplyr::mutate(count = n())
# Excl_Count
Excl_Count <- DF %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = n_distinct(ID))
# Blank_Var1
Blank_Var1 <- DF %>% dplyr::filter(Var1 == '') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_Var2
Blank_Var2 <- DF %>% dplyr::filter(Var2 == '') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_Both
Blank_Both <- DF %>% dplyr::filter(Var1 == '' & Var2 == '') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_New
Blank_New <- DF %>% dplyr::filter(Var1 == '' & Type == 'New') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
# Blank_Old
Blank_Old <- DF %>% dplyr::filter(Var1 == '' & Type == 'Old') %>%
dplyr::group_by(ID) %>%
dplyr::summarize("Count" = sum(count))
输入
structure(list(ID = c("IR-1", "IR-2", "IR-3", "IR-4", "IR-4",
"IR-4", "IR-5", "IR-5", "IR-5", "IR-6", "IR-7", "IR-8", "IR-9",
"IR-10"), Var1 = c("A1", "", "A2", "A1", "A2", "A3", "A1", "A2",
"A3", "", "A2", "", "A2", ""), Var2 = c("X1,X2,X3", "", "X1,X4",
"X1,X2,X3", "X1,X2,X3", "X1,X2,X3", "X1,X3", "X1,X3", "X1,X3",
"", "X1,X2,X3", "X1,X2,X3", "X8", ""), Type = c("New", "Old",
"New", "New", "New", "New", "New", "New", "New", "New", "New",
"New", "New", "Old")), class = "data.frame", row.names = c(NA,
-14L))
【问题讨论】:
-
你能解释一下到目前为止你尝试了什么以及你卡在哪里
-
我很确定,您至少可以自己达到您想要的条件之一。所以请在寻求帮助之前先表现出你的努力。
-
@maydin-更新了我尝试过的代码,但不起作用。
-
我无法理解!为什么
A1的Total是2?为什么X16?太混乱了?您的代码从第一步本身就错了?DISTINCT是什么? -
GT、XPKP等是什么?