【问题标题】:How do I check the common occurrences of one type of column value category for for other remaining column values in R?如何检查 R 中其他剩余列值的一种类型的列值类别的常见出现?
【发布时间】:2021-11-13 14:55:56
【问题描述】:

虚拟数据:

set.seed(4)
name <- sample(LETTERS[1:8], 500, replace = T)
id <- round(runif(500, min=1, max=200))

df <- data.frame(name, id)

我想检查 B 中唯一 id 的百分比,哪些是其他剩余的 name

预期的输出将是这样的:

name  count pct_common
  <chr> <int>      <dbl>
1 A        17       29.3
2 C        18       31.0
3 D        16       27.6
4 E        22       37.9
5 F        14       24.1
6 G        16       27.6
7 H        20       34.5

到目前为止我的方法:

the_name <- 'B'

#Selecting the unique name, id combination for 'B'

df %>%
  filter(name %in% the_name) %>%
  distinct(name, id)-> list_id

#Checking which of these ids are already there for other names and then count them.

df %>%
  filter( id %in% list_id$id) %>%
  filter(!name %in% the_name) %>%
  group_by(name) %>%
  summarise(count=n()) %>%
  mutate(pct_common= count/nrow(list_id)*100)

它正在完成工作,但是像这样创建一个单独的数据框似乎不是很优雅。此外,相对较大的数据框(数百万个观测值)需要更多时间。

有没有更好的方法来解决这个问题?

【问题讨论】:

    标签: r dataframe dplyr subset


    【解决方案1】:

    我们可以在一个管道中做到这一点。创建一个逻辑向量('i1')作为列,根据'i1'作为'n1'获取'id'的不同元素的数量,然后在一个步骤中执行filtercount并获得百分比通过将“计数”除以“n1”

    library(dplyr)
    df %>%
        mutate(i1 = name %in% the_name, n1 = n_distinct(id[i1])) %>% 
        filter(id %in% id[i1], !i1) %>% 
        count(name, n1, name = 'count') %>%
        mutate(pct_common= count/n1*100, n1 = NULL)
    

    -输出

    name count pct_common
    1    A    17   29.31034
    2    C    18   31.03448
    3    D    16   27.58621
    4    E    22   37.93103
    5    F    14   24.13793
    6    G    16   27.58621
    7    H    20   34.48276
    

    注意:OP 询问了It is getting the job done but creating a separate data frame like this doesn't seem very elegant. Also, it is taking more time for a comparatively large data frame 上面的代码分 5 步完成,并且不会多次执行相同的计算,即name %in% the_name


    如果数据真的很大,可以使用collapse

    library(collapse)
    n1 <- fndistinct(df$id[df$name %in% the_name])
    ss(df, id %in% id[name %in% the_name] & !name %in% the_name) %>% 
        fnobs(g = .$name, drop = FALSE) %>% 
        tfm(pct_common = 100 *name/n1) %>%
       frename(name = count) %>%
       tfm(id = NULL)
      count pct_common
    A    17   29.31034
    C    18   31.03448
    D    16   27.58621
    E    22   37.93103
    F    14   24.13793
    G    16   27.58621
    H    20   34.48276
    

    【讨论】:

      【解决方案2】:

      这是另一种选择-

      library(dplyr)
      
      df %>%
        mutate(temp = n_distinct(id[name %in% the_name])) %>%
        filter(id %in% unique(id[name %in% the_name]) & !name %in% the_name) %>%
        group_by(name, temp) %>%
        summarise(count = n(), .groups = 'drop') %>%
        mutate(pct_common = count/temp * 100) %>%
        select(-temp)
      
      #  name  count pct_common
      #  <chr> <int>      <dbl>
      #1 A        17       29.3
      #2 C        18       31.0
      #3 D        16       27.6
      #4 E        22       37.9
      #5 F        14       24.1
      #6 G        16       27.6
      #7 H        20       34.5
      

      【讨论】:

        【解决方案3】:

        不是更好,但是我想到了另一种方法,也许它会对您有所帮助

        df %>% 
          left_join(
            df %>% 
              filter(name == "B") %>% 
              mutate(B = 1,N = n_distinct(id)) %>% 
              select(-name) %>% 
              distinct() 
          ) %>% 
          group_by(name) %>% 
          summarise(
            count = sum(B,na.rm = TRUE),
            N = mean(N,na.rm = TRUE)
            ) %>%
          ungroup() %>% 
          mutate(pct_common= 100*count/N)
        
        
          name  count     N pct_common
          <chr> <dbl> <dbl>      <dbl>
        1 A        17    58       29.3
        2 B        65    58      112. 
        3 C        18    58       31.0
        4 D        16    58       27.6
        5 E        22    58       37.9
        6 F        14    58       24.1
        7 G        16    58       27.6
        8 H        20    58       34.5
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-03-25
          • 2021-02-09
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多