如何检查 R 中其他剩余列值的一种类型的列值类别的常见出现？答案

【问题标题】：How do I check the common occurrences of one type of column value category for for other remaining column values in R?如何检查 R 中其他剩余列值的一种类型的列值类别的常见出现？
【发布时间】：2021-11-13 14:55:56
【问题描述】：

虚拟数据：

set.seed(4)
name <- sample(LETTERS[1:8], 500, replace = T)
id <- round(runif(500, min=1, max=200))

df <- data.frame(name, id)

我想检查 B 中唯一 id 的百分比，哪些是其他剩余的 name

预期的输出将是这样的：

name  count pct_common
  <chr> <int>      <dbl>
1 A        17       29.3
2 C        18       31.0
3 D        16       27.6
4 E        22       37.9
5 F        14       24.1
6 G        16       27.6
7 H        20       34.5

到目前为止我的方法：

the_name <- 'B'

#Selecting the unique name, id combination for 'B'

df %>%
  filter(name %in% the_name) %>%
  distinct(name, id)-> list_id

#Checking which of these ids are already there for other names and then count them.

df %>%
  filter( id %in% list_id$id) %>%
  filter(!name %in% the_name) %>%
  group_by(name) %>%
  summarise(count=n()) %>%
  mutate(pct_common= count/nrow(list_id)*100)

它正在完成工作，但是像这样创建一个单独的数据框似乎不是很优雅。此外，相对较大的数据框（数百万个观测值）需要更多时间。

有没有更好的方法来解决这个问题？

【问题讨论】：

标签： r dataframe dplyr subset

【解决方案1】：

我们可以在一个管道中做到这一点。创建一个逻辑向量（'i1'）作为列，根据'i1'作为'n1'获取'id'的不同元素的数量，然后在一个步骤中执行filtercount并获得百分比通过将“计数”除以“n1”

library(dplyr)
df %>%
    mutate(i1 = name %in% the_name, n1 = n_distinct(id[i1])) %>% 
    filter(id %in% id[i1], !i1) %>% 
    count(name, n1, name = 'count') %>%
    mutate(pct_common= count/n1*100, n1 = NULL)

-输出

name count pct_common
1    A    17   29.31034
2    C    18   31.03448
3    D    16   27.58621
4    E    22   37.93103
5    F    14   24.13793
6    G    16   27.58621
7    H    20   34.48276

注意：OP 询问了It is getting the job done but creating a separate data frame like this doesn't seem very elegant. Also, it is taking more time for a comparatively large data frame 上面的代码分 5 步完成，并且不会多次执行相同的计算，即name %in% the_name

如果数据真的很大，可以使用collapse

library(collapse)
n1 <- fndistinct(df$id[df$name %in% the_name])
ss(df, id %in% id[name %in% the_name] & !name %in% the_name) %>% 
    fnobs(g = .$name, drop = FALSE) %>% 
    tfm(pct_common = 100 *name/n1) %>%
   frename(name = count) %>%
   tfm(id = NULL)
  count pct_common
A    17   29.31034
C    18   31.03448
D    16   27.58621
E    22   37.93103
F    14   24.13793
G    16   27.58621
H    20   34.48276

【讨论】：

【解决方案2】：

这是另一种选择-

library(dplyr)

df %>%
  mutate(temp = n_distinct(id[name %in% the_name])) %>%
  filter(id %in% unique(id[name %in% the_name]) & !name %in% the_name) %>%
  group_by(name, temp) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(pct_common = count/temp * 100) %>%
  select(-temp)

#  name  count pct_common
#  <chr> <int>      <dbl>
#1 A        17       29.3
#2 C        18       31.0
#3 D        16       27.6
#4 E        22       37.9
#5 F        14       24.1
#6 G        16       27.6
#7 H        20       34.5

【讨论】：

【解决方案3】：

不是更好，但是我想到了另一种方法，也许它会对您有所帮助

df %>% 
  left_join(
    df %>% 
      filter(name == "B") %>% 
      mutate(B = 1,N = n_distinct(id)) %>% 
      select(-name) %>% 
      distinct() 
  ) %>% 
  group_by(name) %>% 
  summarise(
    count = sum(B,na.rm = TRUE),
    N = mean(N,na.rm = TRUE)
    ) %>%
  ungroup() %>% 
  mutate(pct_common= 100*count/N)


  name  count     N pct_common
  <chr> <dbl> <dbl>      <dbl>
1 A        17    58       29.3
2 B        65    58      112. 
3 C        18    58       31.0
4 D        16    58       27.6
5 E        22    58       37.9
6 F        14    58       24.1
7 G        16    58       27.6
8 H        20    58       34.5

【讨论】：