计算树上两个位置之间距离的性能？答案

【问题标题】：Performence for calculating the distance between two positions on a tree?计算树上两个位置之间距离的性能？
【发布时间】：2015-04-01 20:44:24
【问题描述】：

这是一棵树。第一列是分支的标识符，其中0 是主干，L 是左侧的第一个分支，R 是右侧的第一个分支。 LL 是第二次分叉后最左边的分支，等等。变量length 包含每个分支的长度。

> tree
  branch length
1      0     20
2      L     12
3     LL     19
4      R     19
5     RL     12
6    RLL     10
7    RLR     12
8     RR     17

tree = data.frame(branch = c("0","L", "LL", "R", "RL", "RLL", "RLR", "RR"), length=c(20,12,19,19,12,10,12,17))
tree$branch = as.character(tree$branch)

这是这棵树的图画

这是这棵树上的两个位置

posA = tree[4,]; posA$length = 12
posB = tree[6,]; posB$length = 3

位置由分支 ID 和到分支原点的距离（变量 length）给出（更多信息在编辑中）。

我编写了以下凌乱的distance 函数来计算树上任意两点之间的沿树枝的最短距离。 沿着树枝的最短距离可以理解为蚂蚁沿着树枝行走从另一个位置到达一个位置所需的最小距离。

distance = function(tree, pos1, pos2){
    if (identical(pos1$branch, pos2$branch)){Dist=pos1$length-pos2$length;return(Dist)}
    pos1path = strsplit(pos1$branch, "")[[1]]
    if (pos1path[1]!="0") {pos1path = c("0", pos1path)}
    pos2path = strsplit(pos2$branch, "")[[1]]
    if (pos2path[1]!="0") {pos2path = c("0", pos2path)}
    loop = 1:min(length(pos1path), length(pos2path))
    loop = loop[-which(loop == 1)]

    CommonTrace="included"; for (i in loop) {
        if (pos1path[i] != pos2path[i]) {
            CommonTrace = i-1; break
            }
        }

    if(CommonTrace=="included"){
        CommonTrace = min(length(pos1path), length(pos2path))
        if (length(pos1path) > length(pos2path)) {
            longerpos = pos1; shorterpos = pos2; longerpospath = pos1path
        } else {
            longerpos = pos2; shorterpos = pos1; longerpospath = pos2path
        }
        distToNode = 0
        if ((CommonTrace+1) != length(longerpospath)){
            for (i in (CommonTrace+1):(length(longerpospath)-1)){
                distToNode = distToNode + tree$length[tree$branch == paste0(longerpospath[2:i], collapse='')]
            }   
        }
        Dist = distToNode + longerpos$length + (tree[tree$branch == shorterpos$branch,]$length-shorterpos$length)
        if (identical(shorterpos, pos1)){Dist=-Dist}
        return(Dist)
    } else { # if they are sisterbranch
        Dist=0 
        if((CommonTrace+1) != length(pos1path)){
            for (i in (CommonTrace+1):(length(pos1path)-1)){
                Dist = Dist + tree$length[tree$branch == paste0(pos1path[2:i], collapse='')]
            }   
        }
        if((CommonTrace+1) != length(pos2path)){
            for (i in (CommonTrace+1):(length(pos2path)-1)){
                Dist = Dist + tree$length[tree$branch == paste(pos2path[2:i], collapse='')]
            }
        }
        Dist = Dist + pos1$length + pos2$length
        return(Dist)
    }
}

我认为该算法运行良好，但效率不高。注意重要的距离符号。仅当在“姐妹分支”上找不到两个位置时，此标志才有意义。只有当两个位置中的一个位于根和另一个位置之间的路上时，该符号才有意义。

distance(tree, posA, posB) # -22

然后我就这样遍历所有感兴趣的位置：

allpositions=rbind(tree, tree)
allpositions$length = c(1,5,8,2,2,3,5,6,7,8,2,3,1,2,5,6)
mat = matrix(-1, ncol=nrow(allpositions), nrow=nrow(allpositions))
    for (i in 1:nrow(allpositions)){
       for (j in 1:nrow(allpositions)){
          posA = allpositions[i,]
          posB = allpositions[j,]
          mat[i,j] = distance(tree, posA, posB)
       }
    }

#     1   2   3   4   5   6   7   8  9  10  11  12  13  14  15  16
# 1   0 -24 -39 -21 -40 -53 -55 -44 -6 -27 -33 -22 -39 -52 -55 -44
# 2  24   0 -15   7  26  39  41  30 18  -3  -9   8  25  38  41  30
# 3  39  15   0  22  41  54  56  45 33  12   6  23  40  53  56  45
# 4  21   7  22   0 -19 -32 -34 -23 15  10  16  -1 -18 -31 -34 -23
# 5  40  26  41  19   0 -13 -15   8 34  29  35  18   1 -12 -15   8
# 6  53  39  54  32  13   0   8  21 47  42  48  31  14   1   8  21
# 7  55  41  56  34  15   8   0  23 49  44  50  33  16   7   0  23
# 8  44  30  45  23   8  21  23   0 38  33  39  22   7  20  23   0
# 9   6 -18 -33 -15 -34 -47 -49 -38  0 -21 -27 -16 -33 -46 -49 -38
# 10 27   3 -12  10  29  42  44  33 21   0  -6  11  28  41  44  33
# 11 33   9  -6  16  35  48  50  39 27   6   0  17  34  47  50  39
# 12 22   8  23   1 -18 -31 -33 -22 16  11  17   0 -17 -30 -33 -22
# 13 39  25  40  18  -1 -14 -16   7 33  28  34  17   0 -13 -16   7
# 14 52  38  53  31  12  -1   7  20 46  41  47  30  13   0   7  20
# 15 55  41  56  34  15   8   0  23 49  44  50  33  16   7   0  23
# 16 44  30  45  23   8  21  23   0 38  33  39  22   7  20  23   0

例如，让我们考虑对象allpositions 中的第一个和第三个位置。它们之间的距离是39（和-39），因为蚂蚁需要在0 分支上行走19 个单位，然后在L 分支上行走12 个单位，最后蚂蚁需要在8 个单位上行走在分支上LL。 19 + 12 + 8 = 39

问题是我有大约 20 棵非常大的树，大约有 50000 个位置，我想计算任意两个位置之间的距离。因此需要计算 20 * 50000^2 的距离。它需要永远！你能帮我改进我的代码吗？

编辑

如果还有什么不清楚的地方请告诉我

tree 是对树的描述。这棵树有某个length 的分支。分支的名称（变量：branch）表明了分支之间的关系。分支RL 是两个分支RLL 和RLR 的“父分支”，其中R 和L 代表左右。

allpositions 是一个 data.frame，其中每一行代表树上的一个独立位置。你可以想象松鼠的位置。该位置由两个信息定义。 1）松鼠站立的树枝（变量：branch）以及树枝起点到松鼠位置的距离（变量：length）。

三个例子

假设第一只松鼠位于分支RL（长度为12）上的位置（变量：length）8，第二只松鼠位于分支的位置（变量：length）2 RLL 或 RLR。两只松鼠之间的距离是12 - 8 + 2 = 6（或-6）。

假设第一只松鼠位于分支RL 上的位置（变量：length）8，第二只松鼠位于分支RR 的位置（变量：length）2。两只松鼠之间的距离是8 + 2 = 10（或-10）。

考虑第一只松鼠位于分支R（长度为19）上的位置（变量：length）8，第二只松鼠位于分支的位置（变量：length）2 RLL。知道那个分支RL的长度是12，两只松鼠之间的距离是19 - 8 + 12 + 2 = 25（或-25）。

【问题讨论】：

发布的代码和数据给了我：Error in pos1$branch : $ operator is invalid for atomic vectors -- 请更新，以便成为可重现的示例。
您的branch 变量是字符向量吗？我认为这可能会导致问题。我添加了两行，以便清楚地了解如何构建 tree 对象。它解决了问题吗？感谢您的评论。
我刚刚复制了您发布的用于创建tree 的代码，复制了您的distance 函数，并在运行distance(tree, 1, 2) 时遇到了错误。
该函数接受参数tree、pos1 和pos2。尝试运行distance(tree, pos1, pos2)。我重新启动了 R，它仍然可以在我的计算机上运行。它对你的不起作用吗？
哦，我假设allpositions 存储了索引，因为您使用它来索引mat。如果您有类似allpositions <- list(pos1, pos2) 的东西，那么您的最后一段代码将无法工作（因为mat[i,j] 将无法工作）。这就是让我失望的原因。你能更新最后一点，让它可以运行吗？这将涉及添加allpositions 和初始化mat。

标签： r performance tree

【解决方案1】：

下面的代码使用igraph 包来计算tree 中位置之间的距离，并且似乎比您在问题中发布的代码要快得多。该方法是在分支交叉点和沿树枝的位置在allpositions 中指定的位置创建图顶点。图边是这些顶点之间的分支段。它使用igraph 为树和allpositions 构建图，然后找到与allposition 数据对应的顶点之间的距离。

t.graph <- function(tree, positions) {
  library(igraph)
  #  Assign vertex name to tree branch intersections
  n_label <- nchar(tree$branch)
  tree$high_vert <- tree$branch
  tree$low_vert <- tree$branch
  tree$brnch_type <- "tree"
  for( i in 1:nrow(tree) ) {
    tree$low_vert[i] <- if(n_label[i] > 1) substr(tree$branch[i], 1, n_label[i]-1)
    else { if(tree$branch[i] %in% c("R","L")) "0"
           else "root" }
  }
  #  combine position data with tree data    
  positions$brnch_type <- "position"
  temp <- merge(positions, tree, by = "branch")
  positions <- temp[, c("branch","length.x","high_vert","low_vert","brnch_type.x")]
  positions$high_vert <- paste(positions$branch, positions$length.x, sep="_")
  colnames(positions) <- c("branch","length","high_vert","low_vert","brnch_type")
  tree <- rbind(tree, positions)
  #   use positions to segment tree branches    
  tree_brnch <- split(tree, tree$branch)
  tree <- data.frame( branch=NA_character_, length = NA_real_, high_vert = NA_character_, 
                      low_vert = NA_character_, brnch_type =NA_character_, seg_len= NA_real_)
  for( ib in 1: length(tree_brnch)) {
    brnch_seg <- tree_brnch[[ib]][order(tree_brnch[[ib]]$length, decreasing=TRUE), ]
    n_seg <- nrow(brnch_seg)
    brnch_seg$seg_len <- brnch_seg$length
    for( is in 1:(n_seg-1) ) {
      brnch_seg$seg_len[is] <- brnch_seg$length[is] - brnch_seg$length[is+1]
      brnch_seg$low_vert[is] <- brnch_seg$high_vert[is+1]
    }
    tree  <- rbind(tree, brnch_seg)
  }
  tree <- tree[-1,]
  #  Create graph of tree and positions
  tree_graph <- graph.data.frame(tree[,c("low_vert","high_vert")])  
  E(tree_graph)$label <- tree$high_vert
  E(tree_graph)$brnch_type <- tree$brnch_type
  E(tree_graph)$weight <- tree$seg_len
  #  calculate shortest distances between position vertices
  position_verts <- V(tree_graph)[grep("_", V(tree_graph)$name)] 
  vert_dist <- shortest.paths(tree_graph, v=position_verts, to=position_verts, mode="all")  
  return(dist_mat= vert_dist )
}

我已经通过使用您的 distance 函数为您的代码在 allposition 数据上创建一个名为 Remi 的函数，针对您问题中发布的代码对 igraph 代码（t.graph 函数）进行了基准测试。示例树是作为树的扩展创建的，allpositions 数据为 64、256 和 2048 个分支的树，allpositions 等于这些大小的两倍。执行时间的比较如下所示。请注意，时间以毫秒为单位。

 microbenchmark(matR16 <- Remi(tree, allpositions), matG16 <- t.graph(tree, allpositions),
                matR256 <- Remi(tree256, allpositions256), matG256 <- t.graph(tree256, allpositions256), times=2)
Unit: milliseconds
                                         expr          min           lq         mean       median           uq          max neval
           matR8 <- Remi(tree, allpositions)     58.82173     58.82173     59.92444     59.92444     61.02714     61.02714     2
        matG8 <- t.graph(tree, allpositions)     11.82064     11.82064     13.15275     13.15275     14.48486     14.48486     2
    matR256 <- Remi(tree256, allpositions256) 114795.50865 114795.50865 114838.99490 114838.99490 114882.48114 114882.48114     2
 matG256 <- t.graph(tree256, allpositions256)    379.54559    379.54559    379.76673    379.76673    379.98787    379.98787     2

与您发布的代码相比，igraph 的结果对于 8 个分支的情况仅快 5 倍，但对于 256 个分支的结果却快了 300 倍以上，因此 igraph 似乎可以更好地扩展大小。我还对 2048 分支案例的 igraph 代码进行了基准测试，结果如下。再次以毫秒为单位。

microbenchmark(matG8 <- t.graph(tree, allpositions), matG64 <- t.graph(tree64, allpositions64),
               matG256 <- t.graph(tree256, allpositions256),  matG2k <- t.graph(tree2k, allpositions2k), times=2)
Unit: milliseconds
                                         expr         min          lq        mean      median          uq         max neval
         matG8 <- t.graph(tree, allpositions)    11.78072    11.78072    12.00599    12.00599    12.23126    12.23126     2
    matG64 <- t.graph(tree64, allpositions64)    73.29006    73.29006    73.49409    73.49409    73.69812    73.69812     2
 matG256 <- t.graph(tree256, allpositions256)   377.21756   377.21756   410.01268   410.01268   442.80780   442.80780     2
    matG2k <- t.graph(tree2k, allpositions2k) 11311.05758 11311.05758 11362.93701 11362.93701 11414.81645 11414.81645     2

因此，大约 4000 个位置的距离矩阵可以在 12 秒内计算出来。 t.graph 返回距离矩阵，其中矩阵的行和列由branch names - position on the branch 标记，例如

      0_7 0_1 L_8 L_5 LL_8 LL_2 R_3 R_2 RL_2 RL_1 RLL_3 RLL_2 RLR_5 RR_6
L_5    18  24   3   0   15    9   8   7   26   25    39    38    41   30

显示从L-5（沿 L 分支 5 个单位的位置）到其他位置的距离。我不知道这会处理您最大的案件，但它可能对某些人有所帮助。您的最大箱子的存储要求也存在问题。

【讨论】：

实际上我不必计算姐妹分支的距离。这意味着要计算的距离数 50000^2 可以减少到大约 (5000/4)^2 （第一个除以 2 用于姐妹分支上的所有位置对，第二个用于计算事实上，矩阵的下三角只包含与上三角相同的信息，但符号相反）。我也可以减去对角线，所以到最后 $n$ 个位置有 $(n/4 - n)^2$ 距离要计算（但所有值都必须存储）。
矩阵被输入到我正在使用的软件中，我已经通过了这个软件的代码并询问了作者；没有简单的方法不指示所有成对距离。但是，我将按顺序创建 20 个矩阵，因此这不会导致任何内存问题，并且我可以在具有大量 RAM（我认为是 14gb）的机器上运行这些矩阵的构建和软件，所以它不应该是问题太大了。另外，我没有确切的职位数量，我只想拥有尽可能多的职位。所以我真正需要的是一个高效的算法。谢谢