R：使用成对组合更新邻接矩阵/数据框答案

【问题标题】：R: Update adjacency matrix/data frame using pairwise combinationsR：使用成对组合更新邻接矩阵/数据框
【发布时间】：2017-11-22 09:22:11
【问题描述】：

问题

假设我有这个数据框：

# mock data set
df.size = 10
cluster.id<- sample(c(1:5), df.size, replace = TRUE)
letters <- sample(LETTERS[1:5], df.size, replace = TRUE)
test.set <- data.frame(cluster.id, letters)

会是这样的：

     cluster.id letters
        <int>  <fctr>
 1          5       A
 2          4       B
 3          4       B
 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E
10          1       A

现在我想根据 cluster.id 对它们进行分组，看看我可以在一个集群中找到什么样的字母，例如 cluster 3 包含字母 A,E,D,C。然后我想获得所有唯一的成对组合（但不是与自身的组合，所以没有A,A 例如）：A,E ; A,D, A,C etc. 然后我想在邻接矩阵/数据框中更新这些组合的成对距离。

想法

# group by cluster.id
# per group get all (unique) pairwise combinations for the letters (excluding pairwise combinations with itself, e.g. A,A)
# update adjacency for each pairwise combinations

我尝试了什么

# empty adjacency df
possible <- LETTERS
adj.df <- data.frame(matrix(0, ncol = length(possible), nrow = length(possible)))
colnames(adj.df) <- rownames(adj.df) <- possible


# what I tried
update.adj <- function( data ) {
  for (comb in combn(data$letters,2)) {
    # stucked
  }
}

test.set %>% group_by(cluster.id) %>% update.adj(.)

可能有一种简单的方法可以做到这一点，因为我一直看到邻接矩阵，但我无法弄清楚。如果不清楚，请告诉我

回复评论
回复@Manuel Bickel：对于我作为示例给出的数据（“将类似于”下的表格）：对于完整的数据集，这个矩阵将是 A--> Z，请记住这一点。

  A B C D E
A 0 0 1 1 2
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0

我会解释我做了什么：

    cluster.id letters
        <int>  <fctr>
 1          5       A
 2          4       B
 3          4       B
 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E
10          1       A

只有包含更多 > 1 个唯一字母的集群是相关的（因为我们不希望与自身组合，例如集群 1 只包含字母 B，所以它会导致组合 B,B，因此不相关）：

 4          3       A
 5          3       E
 6          3       D
 7          3       C
 8          2       A
 9          2       E

现在我寻找每个集群我可以做出哪些成对组合：

集群 3：

A,E
A,D
A,C
E,D
E,C
D,C

在邻接矩阵中更新这些组合：

    A B C D E
    A 0 0 1 1 1
    B 0 0 0 0 0
    C 1 0 0 1 1
    D 1 0 1 0 1
    E 2 0 1 1 0

然后去下一个集群

集群 2

A,E

再次更新邻接矩阵：

 A B C D E
A 0 0 1 1 2 <-- note the 2 now
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0

作为对庞大数据集的反应

library(reshape2)

test.set <- read.table(text = "
                            cluster.id   letters
                       1          5       A
                       2          4       B
                       3          4       B
                       4          3       A
                       5          3       E
                       6          3       D
                       7          3       C
                       8          2       A
                       9          2       E
                       10          1       A", header = T, stringsAsFactors = F)

x1 <- reshape2::dcast(test.set, cluster.id ~ letters)

x1
#cluster.id A B C D E
#1          1 1 0 0 0 0
#2          2 1 0 0 0 1
#3          3 1 0 1 1 1
#4          4 0 2 0 0 0
#5          5 1 0 0 0 0

x2 <- table(test.set)

x2
#          letters
#cluster.id A B C D E
#         1 1 0 0 0 0
#         2 1 0 0 0 1
#         3 1 0 1 1 1
#         4 0 2 0 0 0
#         5 1 0 0 0 0


x1.c <- crossprod(x1)
#Error in crossprod(x, y) : 
#  requires numeric/complex matrix/vector arguments

x2.c <- crossprod(x2)
#works fine

【问题讨论】：

我不完全理解您的预期输出应该是什么样子。能否举个例子，谢谢。
这是 adj.df 填充的计数，表示在每个集群中找到组合的频率，这有意义吗？ @ManuelBickel
我得到了关于单个集群中组合的部分，但我不完全理解update.adj 的输出应该是什么。您能否提供一个简短的示例输出（可以很短，例如 2x2 左右）
@ManuelBickel 我更新了我的问题，希望现在很清楚，如果没有请告诉我
感谢您的更新，我认为现在或多或少已经清楚了。我会根据我的日程安排稍后或明天看看...

标签： r combinations adjacency-matrix

【解决方案1】：

根据上述评论，这里是 Tyler Rinker 的代码与您的数据一起使用。我希望这是你想要的。

更新：在下面的 cmets 之后，我添加了一个使用包 reshape2 的解决方案，以便能够处理大量数据。

test.set <- read.table(text = "
                            cluster.id   letters
                       1          5       A
                       2          4       B
                       3          4       B
                       4          3       A
                       5          3       E
                       6          3       D
                       7          3       C
                       8          2       A
                       9          2       E
                       10          1       A", header = T, stringsAsFactors = F)

x <- table(test.set)
x
          letters
#cluster.id A B C D E
#         1 1 0 0 0 0
#         2 1 0 0 0 1
#         3 1 0 1 1 1
#         4 0 2 0 0 0
#         5 1 0 0 0 0

#base approach, based on answer by Tyler Rinker
x <- crossprod(x)
diag(x) <- 0 #this is to set matches such as AA, BB, etc. to zero
x

#         letters
# letters 
#         A B C D E
#       A 0 0 1 1 2
#       B 0 0 0 0 0
#       C 1 0 0 1 1
#       D 1 0 1 0 1
#       E 2 0 1 1 0

#reshape2 approach
x <- acast(test.set, cluster.id ~ letters)
x <- crossprod(x)
diag(x) <- 0
x
#   A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0

【讨论】：

谢谢，这如何检查字母是否在同一个簇中？
我在回答中添加了table() 调用的输出。这为您提供了每个集群中每个字母的计数。可以说，叉积最终检查所有可能组合的所有计数，即您正在寻找的邻接计数（编写矩阵 m 的叉积的另一种方法是m %*% t(m)）。这有帮助吗？
看起来很简单，哈哈，唯一的问题是我收到错误：“表格（标签）错误：尝试使用 >= 2^31 个元素制作表格”，因为我的原始数据集是巨大的；（@Manuel Bickel
我有大约 200 万行以我给出的示例格式（尽管可能应该提到）
Thankyou 现在工作正常，ps 你可以将 dcast 更改为 acast，因此它会自动返回一个矩阵，然后 "as.matrix(x[,-1]" 可以只替换为 x