我假设您从以下内容开始:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
如果是这种情况,您需要“拆分”“首选项”列(按空格),将数据转换为“长”形式,然后将其重塑为宽形式,同时进行制表。
使用正确的工具,这非常简单。
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
我还尝试了dplyr + tidyr 方法,如下所示:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
这两种方法实际上都非常快。使用比您共享的更好的示例数据进行测试,如下所示:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)