【发布时间】:2020-10-08 12:34:13
【问题描述】:
您好,我有一个 df1:
scaf_name coordinates value
JZSA01000001.1 1 2
JZSA01000001.1 2 2
JZSA01000001.1 3 2
JZSA01000001.1 4 2
JZSA01000001.1 5 2
JZSA01000001.1 6 2
JZSA01000001.1 7 2
JZSA01000001.1 8 2
JZSA01000001.1 9 2
JZSA01000001.1 10 2
JZSA01000001.1 11 5
JZSA01000001.1 12 5
JZSA01000001.1 13 5
JZSA01000001.1 14 5
JZSA01000001.1 15 5
JZSA01000001.1 16 5
JZSA01000001.1 17 5
JZSA01000001.1 18 6
JZSA01000002.1 1 2
JZSA01000002.1 2 2
JZSA01000002.1 3 2
JZSA01000002.1 4 2
JZSA01000002.1 5 2
JZSA01000002.1 6 2
JZSA01000003.1 1 5
JZSA01000003.1 2 5
JZSA01000003.1 3 6
JZSA01000003.1 4 6
JZSA01000003.1 5 6
JZSA01000003.1 6 6
JZSA01000003.1 7 6
JZSA01000003.1 8 6
JZSA01000003.1 9 6
还有另一个df_interval
scaffold start end
JZSA01000001.1_0 1 14
JZSA01000001.1_1 15 18
JZSA01000002.1 1 12
JZSA01000003.1_0 1 3
JZSA01000003.1_1 4 6
JZSA01000003.1_2 7 9
我想根据df1$scaf_name$start和df1$scaf_name$end更改df1$scaf_name
比如
每个包含df_interval$scaffold 的df1$scaf_name 以及df1$coordinates os 之间的1-14 将被命名为JZSA01000001.1_0
这里我应该得到输出
scaf_name coordinates value
JZSA01000001.1_0 1 2
JZSA01000001.1_0 2 2
JZSA01000001.1_0 3 2
JZSA01000001.1_0 4 2
JZSA01000001.1_0 5 2
JZSA01000001.1_0 6 2
JZSA01000001.1_0 7 2
JZSA01000001.1_0 8 2
JZSA01000001.1_0 9 2
JZSA01000001.1_0 10 2
JZSA01000001.1_0 11 5
JZSA01000001.1_0 12 5
JZSA01000001.1_0 13 5
JZSA01000001.1_0 14 5
JZSA01000001.1_1 15 5
JZSA01000001.1_1 16 5
JZSA01000001.1_1 17 5
JZSA01000001.1_1 18 6
JZSA01000002.1 1 2
JZSA01000002.1 2 2
JZSA01000002.1 3 2
JZSA01000002.1 4 2
JZSA01000002.1 5 2
JZSA01000002.1 6 2
JZSA01000003.1_0 1 5
JZSA01000003.1_0 2 5
JZSA01000003.1_0 3 6
JZSA01000003.1_1 4 6
JZSA01000003.1_1 5 6
JZSA01000003.1_1 6 6
JZSA01000003.1_2 7 6
JZSA01000003.1_2 8 6
JZSA01000003.1_2 9 6
df1 文件非常大,如果有人想尽可能快,那就太棒了。 谢谢
数据
df1
structure(list(scaf_name = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("JZSA01000001.1",
"JZSA01000002.1", "JZSA01000003.1"), class = "factor"), coor = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L), dinates.value = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 2L, 2L, 2L, 2L, 2L, 2L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-33L))
df_interval
structure(list(scaffold = structure(1:6, .Label = c("JZSA01000001.1_0",
"JZSA01000001.1_1", "JZSA01000002.1", "JZSA01000003.1_0", "JZSA01000003.1_1",
"JZSA01000003.1_2"), class = "factor"), start = c(1L, 15L, 1L,
1L, 4L, 7L), end = c(14L, 18L, 12L, 3L, 6L, 9L)), class = "data.frame", row.names = c(NA,
-6L))
我得到了这个解决方案:
library(data.table)
setDT(df1)[df_interval, scaf_name := scaffold,
on = .(coordinates >= start, coordinates <= end)]
但对于某些scaf_name,它们会从输出中删除...
为 Ronak 编辑
这里是等价的头(df_interval,这里叫interval_tab之后)我用的代码
setDT(interval_tab)[, scaf_name := sub("(?
> head(interval_tab)
scaffold start end scaf_name
1: KQ759765.1 1 1417 KQ759765.1
2: KQ759766.1 1 1389 KQ759766.1
3: KQ759767.1_0 1 23930 KQ759767.1
4: KQ759767.1_1 23931 83220 KQ759767.1
5: KQ759767.1_2 83221 92117 KQ759767.1
6: KQ759767.1_3 92118 92679 KQ759767.1
这里是等效df1的头部(这里称为tab)
> head(tab)
V1 V2 V3
1: KQ759765.1 1 0
2: KQ759765.1 2 0
3: KQ759765.1 3 0
4: KQ759765.1 4 0
5: KQ759765.1 5 0
6: KQ759765.1 6 0
然后我使用了你的代码:
> setDT(tab)[interval_tab, scaf_name := scaffold,on = .(scaf_name, V2 >= start, V2 <= end)]
并收到错误消息
Error in colnamesInt(x, names(on), check_dups = FALSE) :
argument specifying columns specify non existing column(s): cols[1]='scaf_name'
【问题讨论】:
-
您能否指定“每个包含 df_interval$scaffold 的 df1$scaf_name 以及 1-14 之间的 df1$coordinates os 将被命名为 JZSA01000001.1_0”?你的意思是当每个没有下划线之前的脚手架都与 scaf_name 匹配?
标签: r dataframe dplyr data.table