查找数据集 R 中的所有对答案

【问题标题】：Find all pairs in dataset R查找数据集 R 中的所有对
【发布时间】：2021-08-18 14:58:57
【问题描述】：

我有一个包含 3 个这样的列的数据集。

id_evt = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
id_participant = c(1,2,3,4,5,1,3,5,6,8,2,3,4,9,10)
sex = c(W, M, W, M, W, W, W, W, M, M, M, W, M, W, M)

df <- data.frame(cbind(id_evt, id_participant, sex))




id_evt = id of a specifics event
id_participant = id of one participant
sex = sex of the participant

我想查找参加同一活动的所有男女配对。

这是我尝试过的。它有效，但我想获得每对的所有事件列表

library(dplyr)

# create one data set for females
females <- df %>%
 filter(sex == "W") %>%
 select(f_id = id_participant, f_group = id_evt)

# create one data set for males
males <- df %>%
 filter(sex == "M") %>%
 select(m_id = id_participant, m_group = id_evt)

# All possible pairings of males and females
pairs <- expand.grid(f_id = females %>% pull(f_id),
                    m_id = males %>% pull(m_id),
                    stringsAsFactors = FALSE) 

# Merge in information about each individual
pairs <- pairs %>%
 left_join(females, by = "f_id") %>%
 left_join(males, by = "m_id") %>%
 # eliminate any pairings that are in different groups
 filter(f_group == m_group)

非常感谢，

【问题讨论】：

标签： r pairing

【解决方案1】：

大概是这样的吧？

library(data.table)
ans <- lapply( split(setDT(df), by = "id_evt"), function(x) {
  CJ(M = x[sex == "M", id_participant], W = x[sex == "W", id_participant])
})

# $`1`
#    M W
# 1: 2 1
# 2: 2 3
# 3: 2 5
# 4: 4 1
# 5: 4 3
# 6: 4 5
# 
# $`2`
#    M W
# 1: 6 1
# 2: 6 3
# 3: 6 5
# 4: 8 1
# 5: 8 3
# 6: 8 5
# 
# $`3`
#     M W
# 1: 10 3
# 2: 10 9
# 3:  2 3
# 4:  2 9
# 5:  4 3
# 6:  4 9

这是您的 vbase 信息...您是否想知道团队配对的频率（以及在哪个事件上），您可以执行如下操作： #同一对的频率如何？

rbindlist(ans, idcol = "id_evt")[, .(.N, events = paste0(id_evt, collapse = ";")), by = .(M, W)]
#     M W N events
# 1:  2 1 1      1
# 2:  2 3 2    1;3
# 3:  2 5 1      1
# 4:  4 1 1      1
# 5:  4 3 2    1;3
# 6:  4 5 1      1
# 7:  6 1 1      2
# 8:  6 3 1      2
# 9:  6 5 1      2
#10:  8 1 1      2
#11:  8 3 1      2
#12:  8 5 1      2
#13: 10 3 1      3
#14: 10 9 1      3
#15:  2 9 1      3
#16:  4 9 1      3

【讨论】：

嗨，谢谢你，这正是我正在寻找的输出。但是，我有一个包含近 400 万次观察的数据框，如果我尝试一次性完成所有操作，我会因为大小而出错。你知道我怎么能度过这个难关吗？知道我只想保留出现 2 次的对。谢谢
R / data.table 应该对（仅）几百万行数据有很多问题。您确定问题不在您数据的其他地方吗？
老实说，这是有可能的，因为我不太使用 data.table。我可以将作品作为您代码的第一部分。但是当我运行第二部分时，我遇到了这个向量大小错误。如果你告诉我应该是尺寸问题，我会寻找我弄错的地方
我刚刚看到您应该将lapply(...) 的输出分配给ans...查看编辑。

【解决方案2】：

也许你可以试试这个 -

library(dplyr)

df %>%
  group_by(id_evt) %>%
  summarise(pair = c(outer(sort(id_participant[sex == 'M']), 
                      sort(id_participant[sex == 'W']), paste, sep = '-'))) %>%
  ungroup %>%
  count(pair, sort = TRUE, name = 'number_of_events')

#   pair  number_of_events
#   <chr>            <int>
# 1 2-3                  2
# 2 4-3                  2
# 3 10-3                 1
# 4 10-9                 1
# 5 2-1                  1
# 6 2-5                  1
# 7 2-9                  1
# 8 4-1                  1
# 9 4-5                  1
#10 4-9                  1
#11 6-1                  1
#12 6-3                  1
#13 6-5                  1
#14 8-1                  1
#15 8-3                  1
#16 8-5                  1

【讨论】：

嗨，谢谢你，它工作得很好。但是，因为我有大量数据（400 万个）。我有一个矢量大小错误。你知道我怎么能找到解决办法吗？知道我只想保留至少出现 2 次的对吗？