提高可视化重叠段的性能答案

【问题标题】：Increasing the performance of visualising overlapping segments提高可视化重叠段的性能
【发布时间】：2012-04-09 21:25:27
【问题描述】：

我有一组 x 点对沿 x 轴绘制线段以在 R 中创建自定义读取映射：

绘制这些线段的一半任务是确定它们的 y 位置，以便没有两个重叠的线段在同一 y 水平上。对于每个段，我从第一个位置迭代 y 级，直到到达一个尚未包含与当前段重叠的段的位置。然后我记录当前段的结束位置并移动到下一个。

实际代码是一个函数如下：

# Dummy data
# A list of start and end positions for each segment along the X axis. Sorted by start.
# Passing the function few.reads draws a map in half a second. Passing it many.reads takes about half an hour to complete.
few.reads <- data.frame( start=c(rep(10,150), rep(16,100), rep(43,50)), end=c(rep(30,150), rep(34,100), rep(57,50)) );
many.reads <- data.frame( start=c(rep(10,15000), rep(16,10000), rep(43,5000)), end=c(rep(30,15000), rep(34,10000), rep(57,5000)) );

#---
# A function to draw a series of overlapping segments (or "reads" in my along
# The x-axis. Where reads overlap, they are "stacked" down the y axis
#---
drawReads <- function(reads){

    # sort the reads by their start positions
    reads <- reads[order(reads$start),];

    # minimum and maximum for x axis
    minstart <- min(reads$start);
    maxend <- max(reads$end);

    # initialise yread: a list to keep track of used y levels
    yread <- c(minstart - 1);
    ypos <- c(); #holds the y position of the ith segment

    #---
    # This iteration step is the bottleneck. Worst case, when all reads are stacked on top
    # of each other, it has to iterate over many y levels to find the correct position for
    # the later reads
    #---
    # iterate over segments
    for (r in 1:nrow(reads)){
        read <- reads[r,];
        start <- read$start;
        placed <- FALSE;

        # iterate through yread to find the next availible
        # y pos at this x pos (start)
        y <- 1;
        while(!placed){

            if(yread[y] < start){
                ypos[r] <- y;
                yread[y] <- read$end;
                placed <- TRUE;
            } 

            # current y pos is used by another segment, increment
            y <- y + 1;
            # initialize another y pos if we're at the end of the list
            if(y > length(yread)){
                yread[y] <- minstart-1;
            }
        }
    }

    #---
    # This is the plotting step
    # Once we are here the rest of the process is very quick
    #---
    # find the maximum y pos that is used to size up the plot
    maxy <- length(yread);
    miny = 1;


    reads$ypos <- ypos + miny;

    print("New Plot...")
    # Now we have all the information, start the plot
    plot.new();
    plot.window(xlim=c(minstart, maxend+((maxend-minstart)/10)), ylim=c(1,maxy));

    axis(3,xaxp=c(minstart,maxend,(maxend-minstart)/10));
    axis(2, yaxp=c(miny,maxy,3),tick=FALSE,labels=FALSE);

    print("Draw the reads...");
    maxy <- max(reads$ypos);
    segments(reads$start, maxy-reads$ypos, reads$end, maxy-reads$ypos, col="blue");   
}

我的实际数据集非常大，据我所知，包含的区域最多可以有 600000 个读取。读取会自然堆叠在一起，因此很容易实现最坏的情况，即所有读取都相互重叠。绘制大量读取所花费的时间对我来说是不可接受的，因此我正在寻找一种方法来提高该过程的效率。我可以用更快的东西替换我的循环吗？有没有一种算法可以更快地安排读取？目前我真的想不出更好的方法。

感谢您的帮助。

【问题讨论】：

别管画了，你怎么可能解释一张有600,000条线的图表？
我正在写出这些地图，以便手动选择我的数据中在其读取布局中具有特定特征的区域。如果我有很多读取堆积起来，它们最终会被压扁成一个波浪形的矩形。在那一点上，地图仍然向我展示了一些东西，即使将它变成直方图可能会更好。不过你提出了一个很好的观点，我可能会走上一条相当不相关的道路。

标签： performance algorithm r graph segments

【解决方案1】：

以贪婪的方式填充每个 y 水平。填满一层后，再往下一层，再也不往上走。

伪代码：

 y <- 1
 while segment-list.not-empty
   i <- 1
   current <- segment-list[i]
   current.plot(y)
   segment-list.remove(i)
   i <- segment-list.find_first_greater(current.end)
   while (i > 0)
     current <- segment-list[i]
     current.plot(y)
     segment-list.remove(i)
   y <- y + 1

这并不一定会产生任何意义上的“最佳”图，但至少它是 O(n log n)。

【讨论】：

这不取决于segment-list.find_first_greater(current.end)的速度吗？我们本质上不是对每个 y 级别的段进行迭代吗？
列表按段的开始排序，因此可以进行二分查找。移除可能是个问题，但可以通过使用树来解决。
在 R 中找到一种快速的方法来对列表执行这些低级操作可能很有趣，但这是一个不同的问题。感谢您向我展示这种方法！

【解决方案2】：

你不能按起始值排序吗？然后你从前到后浏览列表。对于每个项目，绘制它，然后对列表的其余部分进行二分搜索，寻找大于刚刚绘制的项目的结束坐标的第一项。如果没有找到，则增加 Y。在绘制时删除每个项目。

排序是O(N lg N)，对每一项进行二分查找是O(lg N)，所以总数是O(N lg N)。

【讨论】：