【发布时间】:2021-02-18 19:54:44
【问题描述】:
我有数据,其中一个 id 变量应该标识一个独特的观察。但是,有些 id 是重复的。我想通过按 id 分组然后计算每个变量的不一致响应的比例来了解哪些测量值正在推动这种重复。
下面是我的意思的一个例子:
require(tidyverse)
df <- tibble(id = c(1,1,2,3,4,4,4),
col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
)
我可以使用 group_by()、cross() 和 n_distinct() 来测试 id 中的不一致响应,如下所示:
# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>%
group_by(id) %>%
mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>%
ungroup()
为简单起见,我现在可以为每个 id 取一行:
# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))
现在我想计算每个变量包含不一致响应的 id 的比例。我想做类似以下的事情:
consistency <- df %>%
summarise(across(contains('distinct'), ~sum(.>1) / n(.)))
但这会产生以下错误,我无法解释:
Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.
我可以通过以下方式得到我想要的答案:
# calculate consistency for each column by finding the number of distinct values greater
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>%
summarise(across(.cols = contains('distinct'), ~sum(.>1)))
# next get the number of rows
n_total <- nrow(df)
# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>%
mutate(across(contains('distinct'), ~./n_total))
但这涉及到中间变量,感觉不雅。
【问题讨论】: