在标识符的值内将重叠间隔拆分为非重叠间隔答案

【问题标题】：Split overlapping intervals into non-overlapping intervals, within values of an identifier在标识符的值内将重叠间隔拆分为非重叠间隔
【发布时间】：2019-09-14 03:04:22
【问题描述】：

我想在标识符的类别中采用一组可能重叠的间隔，并创建完全重叠（即相同的开始/结束值）或完全不重叠的新间隔。这些新区间应该共同跨越原始区间的范围，并且不包括任何不在原始区间中的范围。

这需要一个相对较快的操作，因为我正在处理大量数据。

以下是一些示例数据：

library(data.table)
set.seed(1113)
start1 <- c(1,7,9, 17, 18,1,3,20)
end1 <- c(10,12,15, 20, 23,3,5,25)
id1 <- c(1,1,1,1,1,2,2,2)
obs <- rnorm(length(id1))
x <- data.table(start1,end1,id1,obs)

    > x
   start1 end1 id1         obs
1:      1   10   1 -0.79701638
2:      7   12   1 -0.09251333
3:      9   15   1 -0.08118742
4:     17   20   1 -2.33312797
5:     18   23   1  0.26581138
6:      1    3   2 -0.34314127
7:      3    5   2 -0.17196880
8:     20   25   2  0.11614842

输出应该是这样的：

    id1 start1 end1 i.start1 i.end1         obs
 1:   1      1    6        1     10 -0.79701638
 2:   1      7    8        1     10 -0.79701638
 3:   1      7    8        7     12 -0.09251333
 4:   1      9   10        1     10 -0.79701638
 5:   1      9   10        7     12 -0.09251333
 6:   1      9   10        9     15 -0.08118742
 7:   1     11   12        7     12 -0.09251333
 8:   1     11   12        9     15 -0.08118742
 9:   1     13   15        9     15 -0.08118742
10:   1     17   17       17     20 -2.33312797
11:   1     18   20       17     20 -2.33312797
12:   1     18   20       18     23  0.26581138
13:   1     21   23       18     23  0.26581138
14:   2      1    2        1      3 -0.34314127
15:   2      3    3        1      3 -0.34314127
16:   2      3    3        3      5 -0.17196880
17:   2      4    5        3      5 -0.17196880
18:   2     20   25       20     25  0.11614842

我发现这个算法符合我想要的： https://softwareengineering.stackexchange.com/questions/363091/split-overlapping-ranges-into-all-unique-ranges?newreg=93383e379afe4dd3a595480528ee1541

我试过直接编程，但速度很慢。

【问题讨论】：

标签： r data.table plyr

【解决方案1】：

这是另一种选择。

#borrowing idea from https://stackoverflow.com/a/28938694/1989480
#group overlapping intervals together
x[, g := c(0L, cumsum(shift(start, -1L) > cummax(end))[-.N]), by=.(id)]

#cut those intervals into non-overlapping ones
itvl <- x[, {
    s <- sort(c(start - 1L, start, end, end + 1L))
    as.data.table(matrix(s[s %between% c(min(start), max(end))], ncol=2L, byrow=TRUE))
    }, by=.(id, g)]

#get OP's desired output using non-equi join
x[itvl, on=.(id, start<=V1, end>=V1),
    .(id1=id, start1=V1, end1=V2, i.start1=x.start, i.end1=x.end, obs),
    allow.cartesian=TRUE]

输出：

    id1 start1 end1 i.start1 i.end1         obs
 1:   1      1    6        1     10 -0.79701638
 2:   1      7    8        1     10 -0.79701638
 3:   1      7    8        7     12 -0.09251333
 4:   1      9   10        1     10 -0.79701638
 5:   1      9   10        7     12 -0.09251333
 6:   1      9   10        9     15 -0.08118742
 7:   1     11   12        7     12 -0.09251333
 8:   1     11   12        9     15 -0.08118742
 9:   1     13   15        9     15 -0.08118742
10:   1     17   17       17     20 -2.33312797
11:   1     18   20       17     20 -2.33312797
12:   1     18   20       18     23  0.26581138
13:   1     21   23       18     23  0.26581138
14:   2      1    2        1      3 -0.34314127
15:   2      3    3        1      3 -0.34314127
16:   2      3    3        3      5 -0.17196880
17:   2      4    5        3      5 -0.17196880
18:   2     20   25       20     25  0.11614842

数据：

library(data.table)
set.seed(1113)
id <- c(1,1,1,1,1,2,2,2)
x <- data.table(start=c(1,7,9, 17, 18,1,3,20),
    end=c(10,12,15, 20, 23,3,5,25),
    id=id,
    obs=rnorm(length(id)))

解决评论：

library(data.table)
set.seed(1113)
x2 <- data.table(start=c(1,5,5),end=c(5,5,10),id=c(1,1,1),obs=rnorm(3))
x2[, g := c(0L, cumsum(shift(start, -1L) > cummax(end))[-.N]), by=.(id)]
itvl <- x2[, {
    s <- sort(c(start - 1L, start, end, end + 1L))
    as.data.table(matrix(s[s %between% c(min(start), max(end))], ncol=2L, byrow=TRUE))
    }, by=.(id, g)]
ans <- x2[itvl, on=.(id, start<=V1, end>=V1),
    .(id1=id, start1=V1, end1=V2, i.start1=x.start, i.end1=x.end, obs),
    allow.cartesian=TRUE]
ans[start1 >= i.start1 & end1 <= i.end1]

输出：

   id1 start1 end1 i.start1 i.end1         obs
1:   1      1    4        1      5 -0.79701638
2:   1      4    5        1      5 -0.79701638
3:   1      5    5        1      5 -0.79701638
4:   1      5    5        5      5 -0.09251333
5:   1      5    5        5     10 -0.08118742
6:   1      5    6        5     10 -0.08118742
7:   1      6   10        5     10 -0.08118742

【讨论】：

这个数据集似乎有点破坏 chinsoon12 的解决方案（创建了太多行）：x
@Michael，您对此数据集的预期结果是什么？
对于这个数据集，区间 [1,5] 应该返回：{[1,4], [5,5]}。区间 [5,5] 应返回 {[5,5]}，区间 [5,10] 应返回 {[5,5],[6,10]}。您的算法似乎返回了正确的返回间隔 {[1,4],[5,5],[6,10]} 但没有重复正确的次数。
@Michael，我添加了一种可能的方法，其中 [5,5] 被骗了 3 次。这对你有用吗？

【解决方案2】：

这是我的解决方案。它基于此处的算法 (https://softwareengineering.stackexchange.com/questions/363091/split-overlapping-ranges-into-all-unique-ranges?newreg=93383e379afe4dd3a595480528ee1541)，但使用 data.table、shift 和矢量化 ifelse 语句来提高效率。它还与算法不同，因为我的代码允许对由 id_column 标识的多个数据集单独执行此操作。我的方法也忽略了跟踪行（即“属性”），因为当间隔可以很容易地使用foverlaps 合并回原始数据时，没有必要定义它。 foverlaps 也用于排除间隙

请告诉我您是否发现效率低下

remove_overlaps <- function(x, start_column, end_column, id_column=NULL){

  xd <- melt(x[,c(start_column,end_column,id_column),with=FALSE],id=id_column)

  xd[variable==start_column,end:=FALSE]
  xd[variable==end_column,end:=TRUE]
  setorderv(xd,c(id_column, "value","end"))

  xd[,end_next:=shift(end,type="lead"),by=id_column]
  xd[,value_next:=shift(value,type="lead"),by=id_column]


  #excluding end_next when missing should cause this to ignore the last row in each group
  #because this element will be NA as defined by shift
  temp <- xd[,.SD[!is.na(end_next),list(
    start=ifelse(!end,value,value+1),
    end=ifelse(!end_next,value_next-1,value_next)
  )],by=id_column]

  temp <- temp[end>=start]

  setnames(temp , c("start","end"),c(start_column,end_column))

  setkeyv(temp,c(id_column,start_column,end_column))

  out <- foverlaps(x,temp)
  setorderv(out, c(id_column,start_column,
                   paste0("i.",start_column),
                   paste0("i.",end_column)
  ))

  out
}
remove_overlaps(x, start_column="start1",end_column="end1",id_column="id1")

另外，我不认为that page 上链接的建议对于如何排除差距是正确的。

此答案未考虑间隙（间隙不应出现在输出），所以我对其进行了细化： * 如果 e=false，则在 S 中添加 a。如果 e=true，则取 * 定义 n'=n 如果 e=false 或 n'=n+1 如果 e=true * 定义 m'=m-1 if f=false or m'=m if f=true * If n'

这是在 R 中实现的此代码算法的第二个版本：remove_overlaps 没有明确使用silentman。它建议排除间隙，而 remove_overlaps1 使用该建议。请注意，这两个函数都通过随后对 foverlaps 的调用排除了间隙，只有当它们与 x 中的那些（原始数据）部分匹配时才会返回间隔。

library(data.table)



remove_overlaps1 <- function(x, start_column, end_column, id_column=NULL){

  xd <- melt(x[,c(start_column,end_column,id_column),with=FALSE],id=id_column)

  xd[variable==start_column,end:=FALSE]
  xd[variable==end_column,end:=TRUE]
  setorderv(xd,c(id_column, "value","end"))

  xd[,end_next:=shift(end,type="lead"),by=id_column]
  xd[,value_next:=shift(value,type="lead"),by=id_column]

###subset to rows where (e & !f) = FALSE, as per comment suggestion on linked answer
  temp <- xd[,.SD[!is.na(end_next)&!(end & !end_next),list(
    start=ifelse(!end,value,value+1),
    end=ifelse(!end_next,value_next-1,value_next)
  )],by=id_column]

  temp <- temp[end>=start]

  setnames(temp , c("start","end"),c(start_column,end_column))

  setkeyv(temp,c(id_column,start_column,end_column))


  out <- foverlaps(x,temp) #this should exclude gaps since foverlaps by default subsets to 
  setorderv(out, c(id_column,start_column,
                   paste0("i.",start_column),
                   paste0("i.",end_column)
  ))

  out
}

示例数据：

library(data.table)
x <-
  structure(
    list(
      native_id = c(
        "1",
        "1",
        "1",
        "1",
        "1"
      ),
      n_start_date = c(14761, 14775,
                       14789, 14803, 14817),
      n_end_date = c(14776, 14790, 14804, 14818,
                     14832),
      obs = c(
        31.668140525481,
        34.8623263656539,
        35.0841466093899,
        37.2281249364127,
        36.3726151694052
      )
    ),
    row.names = c(NA,-5L),
    class = "data.frame",
    .Names = c("native_id",
               "n_start_date", "n_end_date", "obs")
  )

setDT(x)

> x
   native_id n_start_date n_end_date      obs
1:         1        14761      14776 31.66814
2:         1        14775      14790 34.86233
3:         1        14789      14804 35.08415
4:         1        14803      14818 37.22812
5:         1        14817      14832 36.37262

结果：

> remove_overlaps(x, start_column="n_start_date",end_column="n_end_date",id_column="native_id")
    native_id n_start_date n_end_date i.n_start_date i.n_end_date      obs
 1:         1        14761      14774          14761        14776 31.66814
 2:         1        14775      14776          14761        14776 31.66814
 3:         1        14775      14776          14775        14790 34.86233
 4:         1        14777      14788          14775        14790 34.86233
 5:         1        14789      14790          14775        14790 34.86233
 6:         1        14789      14790          14789        14804 35.08415
 7:         1        14791      14802          14789        14804 35.08415
 8:         1        14803      14804          14789        14804 35.08415
 9:         1        14803      14804          14803        14818 37.22812
10:         1        14805      14816          14803        14818 37.22812
11:         1        14817      14818          14803        14818 37.22812
12:         1        14817      14818          14817        14832 36.37262
13:         1        14819      14832          14817        14832 36.37262

看似不正确，排除了太多区间：

>  remove_overlaps1(x, start_column="n_start_date",end_column="n_end_date",id_column="native_id")
    native_id n_start_date n_end_date i.n_start_date i.n_end_date      obs
 1:         1        14761      14774          14761        14776 31.66814
 2:         1        14775      14776          14761        14776 31.66814
 3:         1        14775      14776          14775        14790 34.86233
 4:         1        14789      14790          14775        14790 34.86233
 5:         1        14789      14790          14789        14804 35.08415
 6:         1        14803      14804          14789        14804 35.08415
 7:         1        14803      14804          14803        14818 37.22812
 8:         1        14817      14818          14803        14818 37.22812
 9:         1        14817      14818          14817        14832 36.37262
10:         1        14819      14832          14817        14832 36.37262

【讨论】：

实际上似乎链接页面softwareengineering.stackexchange.com/questions/363091/… 上的答案中的评论无法排除差距，即（将.SD 子集为(!end | end_next)）-它删除的间隔比它应该的要多——所以我删除了它。我对foverlaps 的调用无论如何都排除了差距，因此在此实现中不需要该步骤。
例如：y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3); setkey(y, start, end); foverlaps(x, y); 请注意，此结果不包括 y 中的区间 [10,15]。所以对于我的回答，temp 中的区间不包含在 out 中，除非它们部分匹配原始数据中的区间 x
添加了一个有无附加约束的函数比较来演示

【解决方案3】：

我为此和一些相关功能写了一个包，intervalaverage：

library(data.table)
set.seed(1113)
start1 <- c(1,7,9, 17, 18,1,3,20)
end1 <- c(10,12,15, 20, 23,3,5,25)
id1 <- c(1,1,1,1,1,2,2,2)
obs <- rnorm(length(id1))
x <- data.table(start1,end1,id1,obs)

library(intervalaverage)

x[, start1:=as.integer(start1)]
x[, end1:=as.integer(end1)]
isolateoverlaps(x,interval_vars = c("start1","end1"),group_vars = "id1")

    id1 start end start1 end1         obs
 1:   1     1   6      1   10 -0.79701638
 2:   1     7   8      1   10 -0.79701638
 3:   1     9  10      1   10 -0.79701638
 4:   1     7   8      7   12 -0.09251333
 5:   1     9  10      7   12 -0.09251333
 6:   1    11  12      7   12 -0.09251333
 7:   1     9  10      9   15 -0.08118742
 8:   1    11  12      9   15 -0.08118742
 9:   1    13  15      9   15 -0.08118742
10:   1    17  17     17   20 -2.33312797
11:   1    18  20     17   20 -2.33312797
12:   1    18  20     18   23  0.26581138
13:   1    21  23     18   23  0.26581138
14:   2     1   2      1    3 -0.34314127
15:   2     3   3      1    3 -0.34314127
16:   2     3   3      3    5 -0.17196880
17:   2     4   5      3    5 -0.17196880
18:   2    20  25     20   25  0.11614842

y <- data.table(start1=c(1L,5L,5L),end1=c(5L,5L,10L),id=c(1L,1L,1L)) 
isolateoverlaps(y,interval_vars = c("start1","end1"),group_vars = "id")

   id start end start1 end1
1:  1     1   4      1    5
2:  1     5   5      1    5
3:  1     5   5      5    5
4:  1     5   5      5   10
5:  1     6  10      5   10

【讨论】：