在 R 中创建特定的交叉表？答案

【问题标题】：Creating a particular cross-tab in R?在 R 中创建特定的交叉表？
【发布时间】：2016-11-26 21:40:07
【问题描述】：

我有一个数据框，其中包含有关对政治候选人（在数据中以“cand”表示）和政治组织（在数据中以“comm”表示）的财务捐款信息。数据框还包括每个贡献者的唯一 ID，数据中的每一行表示所做的一个贡献。我想要做的是获得一个交叉表，显示对于每个政治（非候选人）组织，这些组织有多少捐助者也为数据框中的每个政治候选人做出了贡献。数据框如下所示：

 contributor ID .      organization
 1                     cand1
 2                     cand2
 3                     comm1
 3                     cand1
 4                     cand1
 5                     cand2
 5                     cand1
 5                     comm2

我希望能够创建的是这样的：

  Comm .              Cand
               Cand1 .     Cand2
  Comm1        1           0
  Comm2        1           1

（因为 1 个人 -- ID #3 -- 对 comm1 和 cand1 都做出了贡献，而 1 个人 -- ID #5 -- 对 comm1、cand1 和 cand2 做出了贡献。）

我已经考虑过使用聚合或 dplyr 执行此操作的方法，但我不确定。有人有什么建议吗？

【问题讨论】：

如果内存可用，您可以从crossprod(table(dat)) 开始 - 如here- 和相应的子集，如crossprod(table(dat))[startsWith(levels(dat$org), "comm"), startsWith(levels(dat$org), "cand")]
谢谢。这段代码出现以下错误：表（表）中的错误：尝试使用 >= 2^31 个元素创建表。你有什么建议吗？
按照同样的思路，您可以尝试使用稀疏替代方案 -- library(Matrix); tab = xtabs( ~ contributorID + organization, dat, sparse = TRUE); crossprod(tab[, startsWith(colnames(tab), "comm")], tab[, startsWith(colnames(tab), "cand")])
问题解决了吗？如果是，您能接受其中一个答案吗？

标签： r crosstab summary

【解决方案1】：

你需要使用 tidyr 之类的东西。您需要为每个委员会创建一个变量，并为每个候选人创建一个变量。您的数据已经是长格式，但您现在需要做的是使用组织和捐赠者 ID 作为唯一 ID 创建一个宽数据框。你可以做交叉表。

【讨论】：

【解决方案2】：

dfs = read.table(text = "contributor organization
1 cand1
2 cand2
3 comm1
3 cand1
4 cand1
5 cand2
5 cand1
5 comm2", sep = " ", stringsAsFactors = FALSE, header = TRUE)

# select only comms with their contributor
comms = dfs[grep("^comm", dfs$organization), ]
colnames(comms)[2] = "comms"
# select only cands
cands = dfs[grep("^cand", dfs$organization), ]
colnames(cands)[2] = "cands"

# combine comms and candidates
new_dfs = merge(comms, cands, all = TRUE)
with(new_dfs, table(comms, cands))

更新。尽量避免使用`table` 创建大矩阵

library(tidyr)
library(dplyr)
dfs = read.table(text = "contributor organization
1 cand1
2 cand2
3 comm1
3 cand1
4 cand1
5 cand2
5 cand1
5 comm2", sep = " ", stringsAsFactors = FALSE, header = TRUE)

# select only comms with their contributor
comms = dfs %>% filter(grepl("^comm", organization))

# select only cands
cands = dfs %>% 
    filter(grepl("^cand", organization)) %>% 
    mutate(
        value = 1
    ) %>% 
    spread(key  = organization, value = value, fill = 0)

left_join(comms, cands)

【讨论】：

我理解这段代码背后的直觉，但我遇到了与其他建议类似的情况：表格错误（comms，cands）：尝试使用 >= 2^31 个元素制作表格跨度>

【解决方案3】：

这是使用tidyr、dplyr 和table() 的一种可能解决方案。首先我们计算一个cand和一个com的贡献者数量。

library(tidyr)
library(dplyr)

df_summary <- 
df %>% mutate(ct = 1) %>% spread(organization, ct) %>% 
transmute(
  comm1_cand1 = ifelse(cand1 + comm1 > 0, 1, 0),
  comm2_cand1 = ifelse(cand1 + comm2 > 0, 1, 0),
  comm1_cand2 = ifelse(cand2 + comm1 > 0, 1, 0),
  comm2_cand2 = ifelse(cand2 + comm2 > 0, 1, 0)) %>%
gather() %>%
separate(key, into = c("comm", "cand"), sep = "_")

这给出了一个看起来像这样的双向分类数据框：

#    comm  cand value
#1  comm1 cand1    NA
#2  comm1 cand1    NA
#3  comm1 cand1     1
#4  comm1 cand1    NA
#5  comm1 cand1    NA
#6  comm2 cand1    NA
#7  comm2 cand1    NA
# etc

现在我们根据数据制作双向表。

table(df_summary)

#   cand
#comm    cand1 cand2
#  comm1     1     0
#  comm2     1     1

【讨论】：

这个解决方案看起来不错，但我确实收到了以下错误：“错误：行的标识符重复...”你对此有什么想法吗？
好的。您能否提供导致此问题的更大数据样本？我只使用了您最初的 8 个观察结果，认为它是可扩展的，但似乎不是。

更新。尽量避免使用table 创建大矩阵

更新。尽量避免使用`table` 创建大矩阵