将逗号分隔的邻接列表转换为 2 列边缘列表（构建 igraph 对象）答案

【问题标题】：convert comma separated adjacency list into 2-column edgelist (to build igraph object)将逗号分隔的邻接列表转换为 2 列边缘列表（构建 igraph 对象）
【发布时间】：2016-04-14 09:34:17
【问题描述】：

我通过stackoverflow的大量信息搜索了很多以找到解决方案，但我被卡住了！我正在通过阅读和实践来学习 R 和 igraph，所以如果问题太简单，请多多包涵:)

我一直在使用下面的代码从谷歌学者个人资料页面中提取共同作者的文本数据（邻接列表），我想把它变成共同作者网络，但我没有成功在 Igraph 中使用 graph_from_adjlist ;它没有以正确的方式构建网络，所以我改变了我的方法并尝试先将它们变成边缘列表，然后使用更常见的 graph_from_edgelist 函数，我找到了一个解决方案here；当行数（在我的情况下是出版物）小于 300 时，它工作正常，但除此之外，它在 R 中给出了这个错误：

Error in rep(x[1], length(x) - 1) : invalid 'times' argument
Called from: FUN(X[[i]], ...)
Browse[1]> Q

老实说，我不知道将邻接列表的列转换为 2 列边缘列表的代码逻辑，我无法找出问题所在。

这是我的一小段代码（我已经描述了内联 cmets 中的每个步骤）：

library(scholar)
library(igraph) 
# one scholar profile link (works fine with small number of authors)
scurl <- "https://scholar.google.com/citations?user=nG42BMAAAAAJ&hl=en"
# prof Welman google scholar link as an example that gives the above error
# scurl <- "https://scholar.google.com/citations?user=_q2NODAAAAAJ&hl=en"
citid <- strsplit((strsplit(scurl,"&",fixed = TRUE)[[1]][1]),"=",fixed = TRUE)[[1]][2]
# authors <- as.data.frame(cSplit(subset(get_publications(citid,flush = TRUE),select = "author"),splitCols = "author",sep = ",")) ## this I put to check if authors are extracting in a right way
pub <- get_publications(citid,flush = TRUE)
coauthors <- as.character(tolower(pub$author)) ##to make text differences less effective in result
adjlist=strsplit(coauthors,",") # splits the character strings into list with different vector for each line
col1 <- unlist(lapply(adjlist,function(x) rep(x[1],length(x)-1))) # establish first column of edgelist by replicating the 1st element (=ID number) by the length of the line minus 1 (itself)
col2 <- unlist(lapply(adjlist,"[",-1)) # the second line I actually don't fully understand this command, but it takes the rest of the ID numbers in the character string and transposes it to list vertically
edgelist <- cbind(col1,col2) # creates the edgelist by combining column 1 and 2.
coauthorgraph <- graph_from_edgelist(edgelist,directed = FALSE)
set.seed(333)
coauthorgraph$layout <- layout.circle
tkplot(coauthorgraph)

我尝试将 (times=400) 条件添加到 col2 行，但没有帮助。听到任何建议，我都会非常感激。

【问题讨论】：

标签： r igraph

【解决方案1】：

一列是每个元素减去第一个元素，另一列是重复向量长度的第一个元素 - 1。您可以通过 rep(..., times=lengths(adjlist)) - 1L 获得。所以，在你得到pub之后去接，

## tolower does character conversion, and remove the trailing "..."
coauthors <- sub('[ ,.]+$', '', tolower(pub$author))

## Make edgelist by repeating 1st elements each length(vector)-1L
adjlist <- strsplit(coauthors, '\\s*,\\s*')
edgelist <- cbind(
    unlist(lapply(adjlist, tail, -1L)),                        # col1
    rep(sapply(adjlist, `[`, 1L), times=lengths(adjlist)-1L)   # col2
)

## make graph
g <- graph_from_edgelist(edgelist, directed=FALSE)

## Offset labels a bit: nodes printed from +x-axis counter-clockwise
ord <- V(g)                                               # node order
theta <- seq(0, 2*pi-2*pi/length(ord), 2*pi/length(ord))  # angle
theta[theta>pi] <- -(2*pi - theta[theta>pi])              # convert to [0, pi]
dists <- rep(c(1, 0.7), length.out=length(ord))           # alternate distance

## Plot
plot(g, layout=layout.circle, vertex.label.degree=-theta, 
     vertex.label.dist=dists, vertex.label.cex=1.1,
     vertex.size=14, vertex.color='#FFFFCC', edge.color='#E25822')

更新

该错误来自于尝试给出否定的 times 参数（即，当列表中的元素之一不包含作者时）。

要添加自循环并删除新查询中出现的零个字符条目，您需要首先过滤共同作者列表以仅包含具有字符的条目，然后重复 adjlist 中长度==1 的那些元素。其余的应该是一样的。

scurl <- "scholar.google.com/citations?user=xqefLxQAAAAJ&hl=en"
citid <- regmatches(scurl, gregexpr('(?<=user=)[[:alnum:]]+', scurl, perl=TRUE))
pub <- get_publications(citid, flush=TRUE)

## tolower does character conversion, and remove the trailing "..."
coauthors <- sub('[ ,.]+$', '', tolower(pub$author))
coauthors <- coauthors[nzchar(coauthors)]  # only keep entries that aren't blank

## Add self-loops for single-author entries
adjlist <- strsplit(coauthors, '\\s*,\\s*')
lens <- lengths(adjlist)
adjlist[lens==1L] <- lapply(adjlist[lens==1L], rep, times=2)  # repeat single author entries

然后像以前一样继续。

【讨论】：

感谢您的回答，但同样的问题再次发生，它适用于少量出版物（如您使用的谷歌学者个人资料），例如，如果您使用这两个个人资料很多出版物（# prof snijders scurl scholar.google.com/citations?user=xqefLxQAAAAJ&hl=en" # prof Welman google 学者链接作为示例，给出了上述错误 scurl scholar.google.com/citations?user=_q2NODAAAAAJ&hl=en" ）然后错误再次发生：rep中的错误（sapply（adjlist , [, 1L), times = lengths(adjlist) - 1L) : 'times' 参数无效
而且，是否可以不删除单作者（单作者）论文？在两列中复制它并在图中有一个循环？这样就可以将它们与多作者论文数量进行比较并给出一些见解......顺便说一下，我将很快将它作为一个闪亮的应用程序发布，但我无法在我的亚马逊实例上安装 dyplr 和学者包，因为小 RAM (1 GB)。
我忘了在上面两个 cmets 中提到你的名字。我会很感激听到你的想法如何解决它@slickrickulicious
感谢您的更新，它现在就像一个魅力，但在您从 url 中提取此人的 google 学者 ID 的方法中存在一个小问题，您应该删除 & 之后的部分（包括它），我正在使用这一行来做到这一点： citid
使用与这个问题相同的数据（我之前问过），我对一些首字母不同的作者姓名有疑问，这意味着如果谷歌学者中有一些作者，有不止一个同一个人的拼写，例如“sm iacus”和“s iacus”谁是同一个人，知道我怎么能找到它们相同并用另一个替换（可能是最流行的名字拼写）？它导致生成的网络有 2 个类似人的节点 :(