【发布时间】:2019-08-21 15:45:45
【问题描述】:
我在 R 中有两个数据框。在第一个中,我有两列,一列称为“chr”,另一列称为“位置”;在第二个数据框中,我有三列,一列又是“chr”,另一列是“开始”,另一列是“结束”。我想选择第一个数据帧中 chr 值与第二个数据帧相同的行,但其“位置”位于第二个数据帧的间隔开始端。
为此,我在 R 中编写了一个函数,它可以为我提供所需的输出,但是当我使用大量数据帧运行它时它非常慢。
# My DataFrames are:
bed <- data.frame(Chr = c(rep("chr1",4),rep("chr2",3),rep("chr3",1)),
x1 = c(5,20,44,67,5,20,44,20),
x3=c(12,43,64,94,12,43,64,63))
snv <- data.frame(Chr = c(rep("chr1",6),rep("chr3",6)),
position = c(5,18,46,60,80,90,21,60,75,80,84,87))
# My function is:
get_overlap <- function(df, position, chrom){
overlap <- FALSE
for (row in 1:nrow(df)){
chr = df[row, 1]
start = df[row, 2]
end = df[row, 3]
if(chr == chrom & position %in% seq(start, end)){
overlap <- TRUE
}
}
return(overlap)
}
# The code is:
overlap_vector = c()
for (row in 1:nrow(snv)){
chrom = snv[row, 1]
position = snv[row, 2]
overlap <- get_overlap(bed, position, chrom)
overlap_vector <- c(overlap_vector, overlap)
}
print(snv[overlap_vector,])
如何提高效率?我从未使用过哈希表,这可以是答案吗?
【问题讨论】:
标签: r performance dataframe