【发布时间】:2021-10-29 14:32:27
【问题描述】:
我有一个使用 sparklyr 操作的 spark 数据框,如下所示:
input_data <- data.frame(id = c(10,10,10,20,20,30,30,40,40,40,50,60,70, 80,80,80,100,100,110,110,120,120,120,130,140,150,160,170),
date = c("2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-05"),
group = c("A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A","B","A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A", "A", "B","A"),
event = c(1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,1,0))
我想汇总数据,以便计算每种组合的“事件”(其中 event == 1)和“非事件”(其中 event == 0)的数量,以便最终输出看起来像以下:
data.frame(group_a = c(1,0,0,1,0,1),
group_b = c(0,1,0,1,1,0),
group_c = c(0,0,1,0,1,1),
event_occured = c(3,1,2,0,2,2),
event_not_occured = c(4,2,2,0,2,2))
因此,例如,不存在 A 和 B 是具有相同 ID 的组的组合,因此 event 和 non_event 的组合为 0。 A 组参与的 ID 有 4 个,其中 3 个导致event,1 个导致non_event,依此类推。
使用 sparklyr(或 dplyr 或 pyspark)的哪种方法可以实现如上所述的聚合?我尝试了以下方法,但我得到的event 与event_not_occurred 的数量完全相同,所以我一定做错了什么,但无法查明:
combo_path_sdf <- input_data %>%
group_by(id) %>%
arrange(date) %>%
mutate(order_seq = ifelse(event > 0, 1, NA)) %>%
mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
mutate(order_seq = ifelse((row_number() == 1) & (event > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>%
ungroup()
combo_path_sdf %>%
group_by(id, order_seq) %>%
summarize(group_a = max(ifelse(group_a == "A", 1, 0)),
group_b = max(ifelse(group_b == "B", 1, 0)),
group_c = max(ifelse(group_c == "C", 1, 0)),
events = sum(event)) %>%
group_by(order_seq, group_a, group_b, group_c) %>%
summarize(event = sum(events),
total_sequences = n()) %>%
mutate(event_not_occured = total_sequences - event)
以下格式的最终输出也可以:
data.frame(group_a = c("A", "B", "C", "A,B", "B,C", "A,C"),
event_occured = c(3,1,2,1,2,2),
event_not_occured = c(4,2,2,1,2,2))
【问题讨论】:
-
您的数据显示和预期的输出匹配吗?为什么
A, B0 的 group_a 对这两种事件类型都适用?在您的数据中有 ID 10 的事件A和B。 -
哦,这是一个错误,你是对的。
标签: r apache-spark pyspark dplyr sparklyr