【问题标题】:How to count matches between a vector and dataframe of sequence coordinates?如何计算序列坐标的向量和数据框之间的匹配?
【发布时间】:2019-05-24 14:21:29
【问题描述】:

给定一个数据表,其中包含整数序列的开始和结束坐标:

set.seed(1)

df1 <- data.table(
  START = c(seq(1, 10000000, 10), seq(1, 10000000, 10), seq(1, 10000000, 10)),
  END = c(seq(10, 10000000, 10), seq(10, 10000000, 10), seq(10, 10000000, 10)) 

还有一个整数向量:

vec1 <- sample(1:100000, 10000)

如何计算 vec1 中位于 df1 中每个序列的开始和结束坐标内的整数个数?我目前正在使用 for 循环:

COUNT <- rep(NA, nrow(df1)) 
for (i in 1:nrow(df1)){
  vec2 <- seq(from = df1$START[i], to = df1$END[i])
  COUNT[i] <- table(vec2 %in% vec1)[2]
  print(i)
}
df1$COUNT <- COUNT

但是,我应用它的数据表和向量非常大?有人能提出提高性能的方法吗?

任何帮助将不胜感激!

【问题讨论】:

    标签: r performance data.table


    【解决方案1】:

    一种选择是使用between

    library(data.table)
    df1[, count := sum(between(vec1, START, END)), by = seq_len(nrow(df1))]
    

    【讨论】:

      【解决方案2】:

      我们可以通过非 equi 连接来做到这一点

      df1[data.table(val = vec1),  count := .N,on = .(START < val,
            END >= val), by = .EACHI]
      head(df1)
      

      如果我们想以其他方式获取输出,使用@minem 的示例

      data.table(START = vec1, END = vec1)[df1, .N, 
             on = .(START >= START, END < END), by = .EACHI]
      #   START END N
      #1:     1   4 2
      #2:     8   9 1
      #3:    11  30 0
      

      【讨论】:

      • 这似乎没有给出正确的结果...,如果 df1 中有多个相等的条目
      【解决方案3】:
      ### example data:
      # df1 <- data.table(START = c(1, 8, 11), END = c(4, 9, 30))
      # vec1 <- c(3, 2, 8)
      
      #
      df1[, ind := .I] # add uniqe index to data.table
      dt2 <- as.data.table(vec1, key = 'vec1') # convert to data.table
      dt2[, vec2 := vec1] # dublicate column
      setkey(df1) # sets keys // order data by all columns
      # Fast overlap join:
      ans1 = foverlaps(dt2, df1, by.x = c('vec1', 'vec2'), by.y = c('START', 'END'),
                       type = "within", nomatch = 0L)
      
      counts <- ans1[, .N, keyby = ind] # count by ind
      # merge to inital data
      df1[, COUNT := counts[df1, on = .(ind), x.N]]
      df1
      
      setorder(df1, ind) # reorder by ind to get inital order
      df1[, ind := NULL] # deletes ind colum
      df1[is.na(COUNT), COUNT := 0L] # NAs is 0 count
      df1
      #    START END COUNT
      # 1:     1   4     2
      # 2:     8   9     1
      # 3:    11  30     0
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2022-06-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-04-06
        相关资源
        最近更新 更多