数据框中的 R 汇总/计数组合并显示为列的新数据框/计数矩阵答案

【问题标题】：R summaries/count combinations in a data frame and display as new data frame/count matrix for column数据框中的 R 汇总/计数组合并显示为列的新数据框/计数矩阵
【发布时间】：2019-07-14 14:04:32
【问题描述】：

我有一个大型数据集，需要比较列之间的所有组合。所需的输出将是每个列组合的矩阵。

开始的数据框可能看起来像Data：

set.seed(1)
Data <- data.frame(
  ID = (1:100),
  A = sample(1:10,10),
  B = sample(1:20,100,replace = T),
  C = sample(1:5,100,replace = T),
  D = sample(1:20,100,replace = T)
  )     
Data

我想知道同一组合在两列中出现的频率。（例如，A 中的 1 与 B 中的 4 多久出现一次）对于 A 列到 D 列之间的所有组合？

我正在使用：

require(dplyr)

X1 <- ddply(Data,.(A,B),transmute, count=length(ID))

得到一个像这样的对象：

     A  B count
1    1  3     1
2    1  7     1
3    1  9     2
4    1  9     2
5    1 12     1
6    1 13     1
7    1 14     1
8    1 16     1
9    1 18     1
10   1 20     1
11   2  2     1
12   2  6     1
13   2 10     1
14   2 11     1

但是我怎样才能得到矩阵格式的count 结果呢？

A 与 B 冷的输出如下所示：

    B1  B2  B3  B4  B5  B6
A1  1   1   2   1   1   ...
A2  1   1   2   1   1   
A3  2   1   1   1   1   
A4  2   1   1   1   1   
A5  1   1   2   1   2   
A6  1   1   2   1   2   
A7  1   3   1   1   1   
A8  1   3   1   1   2   
A9  1   3   2   1   2   
A10 1   1   2   1   1


In the best case the result would be a `list`  containing the objects  `AB`  `AC` ...`CD` as matrix.

【问题讨论】：

这是什么transmutefunction ？是你写的还是包里的？你能在你的问题中添加你加载的包吗？
transmutate 来自dplyrpackage
ddply 来自plyr 库。

标签： r dataframe dplyr summary

【解决方案1】：

你可以这样做：

library(tidyverse)
X2 <-X1 %>% group_by(A,B) %>% 
  summarise(count=max(count)) %>% #use max instead of sum
  ungroup() %>%
  mutate(A=paste0("A",A),B=paste0("B",B)) %>% 
  spread(B,count,fill=0)

X3 <- as.matrix(X2[,2:ncol(X2)])
rownames(X3) <- as.character(X2$A)

【讨论】：

thx，看起来很适合这个例子，只需要让它为我的数据运行。
我已经添加了最后两行，以便将结果转换为您在问题中想要的矩阵
感谢矩阵添加，我只是在使用我的数据时遇到了麻烦。因为我得到了不可能的计数（作为数据集中的 ID 计数更高）。难道summarise(count=sum(count)) 采用了所有相似组合的sum，但我只需要它们。
例如在X1 中是A1B9 = 2 的组合，但也显示了两次，在X2 中是A1B9=4。我们是否需要unique 之类的东西才能仅在X2 中获得它？
我认为问题出在 X1 之前，例如您的第 3 行和第 4 行重复，总结时尝试 max 而不是 sum，这应该可以工作

【解决方案2】：

考虑使用aggregate 和reshape 的基R：

agg <- aggregate(cbind(count=ID) ~ B + A, Data, FUN=length)

rdf <- reshape(agg, timevar = "B", idvar = "A",
               drop = c("ID", "C", "D"),
               direction = "wide")

# CLEAN-UP
rdf <- with(rdf, rdf[order(A), c("A", paste0("count.", 1:20))])  # RE-ORDER ROWS AND COLS
rownames(rdf) <- NULL                                            # RESET ROW NAMES
colnames(rdf) <- gsub("count.", "B", names(rdf))                 # RENAME COL NAMES
rdf[is.na(rdf)] <- 0                                             # CONVERT NAs TO O

rdf

#     A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 B17 B18 B19 B20
# 1   1  1  1  0  1  0  1  0  0  0   1   0   1   0   1   0   0   0   0   2   1
# 2   2  0  0  0  0  0  0  2  0  1   3   0   0   2   0   1   0   0   0   1   0
# 3   3  1  1  0  0  0  1  0  0  1   0   0   0   0   2   1   1   1   0   0   1
# 4   4  1  0  0  0  1  2  0  0  0   2   1   1   0   1   0   1   0   0   0   0
# 5   5  0  0  0  0  1  2  1  3  0   0   0   1   0   0   0   0   0   0   1   1
# 6   6  1  0  2  0  1  2  1  0  0   0   0   0   1   0   0   0   0   0   0   2
# 7   7  1  0  0  0  1  1  1  0  0   2   1   0   0   0   0   2   1   0   0   0
# 8   8  0  0  2  0  0  0  1  2  1   0   2   0   0   0   0   0   0   1   1   0
# 9   9  1  0  0  0  0  1  1  0  0   0   0   1   1   2   1   0   0   1   1   0
# 10 10  1  1  0  0  0  0  0  0  1   1   0   2   0   1   1   0   1   0   1   0

【讨论】：