在滑动窗口中查找特定向量条目答案

【问题标题】：Finding Specific Vector Entries in a Sliding Window在滑动窗口中查找特定向量条目
【发布时间】：2016-10-17 06:29:46
【问题描述】：

我正在尝试创建一个函数，该函数将在我已格式化为向量的特定窗口内返回特定相邻核苷酸（彼此相邻的 CG）的计数。

我希望窗口的长度为 100 个核苷酸，并且每 10 个核苷酸移动一次。

数据是这样设置的（到 10k 个条目）：

data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
          "a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")

到目前为止，我已经尝试过：

library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))

但我总是得到错误

"Error in match.fun(FUN) : 
'count(data, 2)' is not a function, character or symbol"

我也试过了：

starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
    chunk <- data[starts[i]:(starts[i]+99)]
    chunkCG <- count(chunk,wordsize=2)
    print (chunkCG)
}

但是，我不知道如何保存返回的数据。这种方法也不允许我重叠帧。

【问题讨论】：

count(data,wordsize=2) 不是函数。你可能需要FUN=function(x) count(x, wordsize=2)。或者甚至可能是 ...,FUN=count, wordsize=2) 用于您的 rollapply 电话。
您想要第 1:100、101:200 等行的“cg”对数？

标签： r dna-sequence sliding-window

【解决方案1】：

编辑：要使用 10 个观察滑动窗口获得所需的输出，您可以使用 for 循环。由于我们预先分配了结果向量的大小，因此循环相当快。我认为这是解决您的问题的最佳方法，因为我认为很多分组（如果有）不支持滑动窗口：

library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))

#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1

#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))

#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
      res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
      subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
      subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
      res$count_cg[i] <- sum(subs2=="cg")
}
   head(res)
  window count_cg
1  1-100       10
2 11-110       10
3 21-120        8
4 31-130        9
5 41-140        9
6 51-150        9

【讨论】：

我其实是想数1:100、11:110、21:120等

【解决方案2】：

您的方法不会重叠，因为您使用by = 100 调用它。否则它看起来很好。只需将其更改为 10。

要从您上次尝试中提取数据，请尝试创建将收集数据的字符向量，然后您可以使用名称索引提取正确的计数。

counted_cg <- vector(mode = "character")

for (i in 1:n){
    chunk <- data[starts[i]:(starts[i]+99)]
    chunkCG <- count(chunk,wordsize=2)
    counted_cg <- c(counted_cg, chunkCG["cg"])
}

【讨论】：