【问题标题】：How can I count, how many Items have been in one session together?我如何计算，在一个会话中总共有多少个项目？
【发布时间】：2021-08-21 18:11:00
【问题描述】：

我真的尽力通过 stackoverflow 搜索解决方案，但不幸的是我找不到合适的问题。因此，我必须自己提出一个问题。

我正在处理一个包含 sessionID 和主题的数据集。想象一下它看起来像这样：

sessionID <- c(1, 2, 2, 3, 4, 4, 5, 6, 6, 6)
topic <- c("rock", "house", "country", "rock", "r'n'b", "pop", "classic", "house", "rock", "country")
transactions <- cbind(sessionID, topic)
transactions

现在，我想知道某个主题有多少个项目一起出现在一个会话中。最后，我想获得一个矩阵，表示特定主题与其他主题的会话频率。最终结果应如下所示：

topics <- sort(unique(topic))
topicPairs <- matrix(NA, nrow = length(topics), ncol = length(topics))
colnames(topicPairs) <- topics
rownames(topicPairs) <- topics
topicPairs["house", "country"] <- 2
topicPairs["country", "house"] <- 2
topicPairs["r'n'b", "pop"] <- 1
topicPairs["pop", "r'n'b"] <- 1
topicPairs["rock", "house"] <- 1
topicPairs["house", "rock"] <- 1
topicPairs["rock", "country"] <- 1
topicPairs["country", "rock"] <- 1
topicPairs["house", "house"] <- 2
topicPairs

例如，在“house”行中，“country”列应该等于 2，因为“house”在第 2 和 6 次会话中与“country”一起出现。

我希望在主对角线上，一个主题在会话中出现的频率。在这里，行“house”列“house”等于 2，因为它已经在两个会话中......但我不确定。

如果您的解决方案不包含循环，那就太棒了，因为我的数据集非常大。因此，我更喜欢 tidyverse 中的函数（dplyr、tidyr 等）。也许是 group_by 和 tidyr 包中的 spread 函数的组合。

我真的在寻找你的答案。非常感谢您！

亲切的问候！

【问题讨论】：

试试类似：crossprod(table(as.data.frame(transactions)))？
嘿，本！工作完美！非常感谢您的快速答复！ :)

标签： r dplyr tidyverse tidyr data-wrangling

【解决方案1】：

如果您不介意通过dplyr 包对自己执行join（transactions），以下应该可以工作：

library(dplyr)
library(tibble)
library(tidyr)

# ...
# Your existing code that created `transactions`.
# ...

# Convert transactions to a dataframe for transformation.
transactions <- as.data.frame(transactions)

result <- transactions %>%
  # Create pairings of topics by session.
  inner_join(transactions, by = "sessionID", suffix = c(".r", ".c")) %>%
  # "Pivot" the pairings, such that each topic within `topics.c` gets its own
  # column; and then aggregate the pairings by count.
  pivot_wider(id_cols = c(sessionID, topic.r),
              names_from = topic.c,
              values_from = sessionID,
              values_fn = length,
              names_sort = TRUE) %>%
  # Sort appropriately, to align the main diagonal.
  arrange(topic.r) %>%
  # Convert to matrix form, with topics as row names.
  column_to_rownames(var = "topic.r") %>% as.matrix()

# View result.
result

这是我的result 的打印输出：

        classic country house pop r'n'b rock
classic       1      NA    NA  NA    NA   NA
country      NA       2     2  NA    NA    1
house        NA       2     2  NA    NA    1
pop          NA      NA    NA   1     1   NA
r'n'b        NA      NA    NA   1     1   NA
rock         NA       1     1  NA    NA    3

更新

Ben 的 suggestion 更优雅（更不用说更聪明了），并且只需要以下内容

# ...
# Your existing code that created `transactions`.
# ...

# Compute the results.
result <- crossprod(table(as.data.frame(transactions)))
# Substitute NAs for 0s, if you so desire.
result <- ifelse(result == 0, NA, result)

达到同样的效果。我不能保证这两种解决方案在大型数据集上的相对性能。

【讨论】：

嘿格雷格！非常感谢您的快速答复！它工作得很好:)
我的荣幸，@RKF！如果您想要更优雅的东西，并且您不介意行名和列名显示的孪生topic 标签，那么@Ben 的解决方案可能会更好。如果您愿意，您只需在最后用0 替换0（如我的回答的更新所示）。
另外，非常感谢您的更新。你的方法也很聪明：）谢谢你的支持，我真的很感激！