R创建边缘列表答案

【问题标题】：R create edge listR创建边缘列表
【发布时间】：2015-03-30 17:42:40
【问题描述】：

Stack Overflow 的新手，尽管我已经检查你们发布和回答的内容已经有一段时间了（只是在我的硕士课程工作时没有时间加入）。

TL;DR：我使用此处提供的脚本来处理如下结构的数据集，以获取网络的边缘。它有效，但处理时间太长（2k 行需要 24 小时）。对初学者 R 用户有什么提高速度的提示吗？

在我的上一个研究项目中，我最终得到了一个类似于以下内容的 data.frame：

ID | Trait 1 | Trait 2 | Trait 3 | Trait 4 |  Trait 5
01 |  TRUE   |   TRUE  |  Photo  |   City  |  Portrait
02 |  FALSE  |   TRUE  |  Draw   |  Child  |  Portrait
03 |  TRUE   |  FALSE  |  Photo  |   Misc  |  Landscape
.
.
.

这持续了大约 2k 行。目的是建立一个网络，其中每个 ID 都是一个节点，两个 ID 之间的共同特征之和将构成一个加权边，即 ID 01 对 ID 2 和 3 都有一个权重 2 边，而 ID 2 将对 ID 3 没有优势。

为了解决这个问题，我使用了以下脚本，该脚本遍历每一行比较每一列的值以增加权重（每个匹配 = +1），忽略已经比较的行（作为无向网络，没有必要同时匹配两种方式）：

键：来源 = 要比较的 ID；目标 = 正在比较的 ID；权重 = 匹配单元格/特征的总和。

findEdges <- function(){
    input <- read.csv("nodes.csv",header=TRUE,stringsAsFactors=FALSE,sep=";")
    edges <- read.csv("edges.csv",header=TRUE,stringsAsFactor=FALSE,skip=1,colClasses=c("integer","integer","integer"),col.names=c("Source","Target","Weight"))    
    for(i in 1:nrow(input)){ #row to be compared: Source
        for(j in 1:nrow(entrada)){ #rows that will compare to: Target
            weight <- 0
            if( i >= j ){
            } else {
                for(k in 1:ncol(input)){ #column by column comparison
                    col <- k
                    if(input[i,k] == input[j,k]){ #edge weight modifier
                        weight <- weight+1
                        }
                }
                print(c("source= ",i,"target= ",j,"weight= ",weight)) #visual feedback of running script
                newRow <- data.frame(Source=i,Target=j,Weight=weight) #create row for compared pair
                edges <- rbind(edges,newRow) # add edge row to data frame
            }
        }
    }
    write.csv(edges,"edges.csv") #write data frame to csv file
}
findEdges()

效果很好，给了我需要的加权边缘列表。边缘列表的每一行将呈现为：

Source | Target | Weight
  01   |   02   |   2
  01   |   03   |   2

等等……

但是，这个脚本花了将近 24 小时来处理整个数据集（2k 行，5 列，除了 ID），虽然这在以前不是问题，但我认为最好检查一些关于更好/更快的提示获得相同结果的方法。

【问题讨论】：

要阅读有关edges <- rbind(edges, newRow) 行的（负面）效率影响的一些信息，我建议（The R Inferno）[burns-stat.com/pages/Tutor/R_inferno.pdf] 的第 2 圈。
谢谢，马上去看看 :)

标签： r algorithm networking

【解决方案1】：

一种方法是分别处理每一列，在每一行之间生成成对相似度矩阵。例如，假设我们正在对单个列进行操作：

col <- c(1, 1, 2, 3, 2, 4)
outer(col, col, "==") * 1
#      [,1] [,2] [,3] [,4] [,5] [,6]
# [1,]    1    1    0    0    0    0
# [2,]    1    1    0    0    0    0
# [3,]    0    0    1    0    1    0
# [4,]    0    0    0    1    0    0
# [5,]    0    0    1    0    1    0
# [6,]    0    0    0    0    0    1

outer 函数在每对元素之间执行我们的运算符 (==)，返回矩阵（*1 只是将 TRUE/FALSE 转换为 0/1）。一个不错的方面是这是一个矢量化运算符，因此与涉及 for 循环的方法相比，它的工作速度非常快。

现在，很明显，我们需要做的就是为每一列获取一个相似度矩阵并将它们全部相加。

(dat <- data.frame(ID=c(1, 2, 3), T1=c(F, F, T), T2=c(T, T, F), T3=c("Photo", "Draw", "Photo"), T4=c("City", "Child", "Misc"), T5=c("Portrait", "Portrait", "Landscape")))
#   ID    T1    T2    T3    T4        T5
# 1  1 FALSE  TRUE Photo  City  Portrait
# 2  2 FALSE  TRUE  Draw Child  Portrait
# 3  3  TRUE FALSE Photo  Misc Landscape
(res <- Reduce("+", lapply(2:ncol(dat), function(x) outer(dat[,x], dat[,x], "=="))))
#      [,1] [,2] [,3]
# [1,]    5    3    1
# [2,]    3    5    0
# [3,]    1    0    5

此函数已确定每一行都有与其自身相同的所有 5 列。此外，第 1 行和第 2 行有 3 个共同元素，第 1 行和第 3 行有 1 个共同元素，第 2 行和第 3 行没有共同元素。

您可以在最后轻松地将图形从宽表示转换为长表示（这里我过滤掉了源 id > 目标 id 的自链接和边）：

subset(cbind(expand.grid(Source=dat$ID, Target=dat$ID), Weight=as.vector(res)),
       Source < Target)
#   Source Target Weight
# 4      1      2      3
# 7      1      3      1
# 8      2      3      0

基准测试表明，向量化的outer 函数为我们提供了优于 for 循环的巨大优势：

set.seed(144)
big.dat <- data.frame(ID=1:100, A=sample(letters, 100, replace=T), B=sample(letters, 100, replace=T), C=sample(1:10, 100, replace=T))
OP <- function(dat) {
  edges <- data.frame(Source=c(), Target=c(), Weight=c())
  for (i in 1:nrow(dat)) {
    for (j in 1:nrow(dat)) {
      if (i < j) {
        weight <- 0
        for (k in 2:ncol(dat)) {
          if (dat[i,k] == dat[j,k]) {
            weight <- weight + 1
          }
        }
        edges <- rbind(edges, data.frame(Source=i, Target=j, Weight=weight))
      }
    }
  }
  edges
}
josilber <- function(dat) {
  res <- Reduce("+", lapply(2:ncol(dat), function(x) outer(dat[,x], dat[,x], "==")))
  ret <- subset(cbind(expand.grid(Source=dat$ID, Target=dat$ID), Weight=as.vector(res)), Source < Target)
  # Changes to exactly match OP's output
  ret <- ret[order(ret$Source, ret$Target),]
  row.names(ret) <- NULL
  ret
}
all.equal(OP(big.dat), josilber(big.dat))
# [1] TRUE
library(microbenchmark)
microbenchmark(OP(big.dat), josilber(big.dat), times=10)
# Unit: milliseconds
#               expr         min          lq        mean      median          uq         max neval
#        OP(big.dat) 5931.354448 6062.872595 6137.497152 6076.736039 6175.002149 6519.977217    10
#  josilber(big.dat)    5.882283    5.914646    6.316981    5.978082    6.368297    8.801991    10

对于使用矢量化方法的 100 行示例，我们实现了大约 1000 倍的加速。

【讨论】：

哦...这很酷。我不知道外部功能，所以我尽快点击它的文档。非常感谢（要试试代码，看看效果如何）