我确信有一个很好的data.table 方法可以做到这一点。如果我想出一个,我会编辑我的答案。现在我有一个简单的方法来使用order 和dplyr。注意Mode函数会取第一个Ethn,所以我在hq2中重新排序来演示。
Dplyr
library(dplyr)
hq2 <- data.table(H_Code = c("AS-01", "AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,1,1,2,4,4,2,1), Ethn = c("SEN", "SEN", "FA", "MA", "MA", "FA", "NZ", "MA"))
hq2[order(H_Code, Rela_HH), ] %>%
group_by(H_Code) %>%
summarize(Ethn_Mode = Mode(Ethn))
# A tibble: 3 x 2
H_Code Ethn_Mode
<chr> <chr>
1 AS-01 SEN
2 AS-02 MA
3 AS-03 MA
数据表
如果您有一张大桌子,您可以将您的键设置为H_Code 和Rela_HH 以进行更快的排序。
library(data.table)
hq1 <- as.data.table(hq2)
hq1[order(H_Code, Rela_HH), ][, Mode(Ethn), by = list(H_Code)]
H_Code V1
1: AS-01 SEN
2: AS-02 MA
3: AS-03 MA
编辑
这是修改后的dplyr 代码。
hq3 %>%
group_by(H_Code, Ethn) %>%
mutate(eth_count = sum(n())) %>%
mutate(priority = all(1 %in% Rela_HH & !1 %in% eth_count)) %>%
arrange(H_Code, desc(priority), desc(eth_count), Rela_HH) %>%
ungroup() %>%
group_by(H_Code) %>%
filter(row_number() %in% 1)
这里是修改后的data.table 代码,它没有使用任何mode 函数。它所做的是创建一个priority 列,如果任何种族出现不止一次,则该列设置为TRUE,并且该组中的一个条目在Rela_HH 中设置了1。然后,您设置排序顺序以计算出顺序和任何关系 (H_Code, -priority, -count, Rela_HH)。
具有不同场景的三个数据集,hq3(HH 种族发生 >1,但 hq4(HH 种族发生 1)和 hq5(HH 种族发生 >1,并且 >非HH)。
hq3 <- data.table(H_Code = c("AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,4,3,2,1,5,2,1), Ethn = c("SEN", "FA", "FA", "FA", "MA", "MA", "NZ", "MA"))
hq4 <- data.table(H_Code = c("AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,4,3,2,1,5,2,1), Ethn = c("SEN", "MA", "FA", "FA", "ZA", "MA", "NZ", "MA"))
hq5 <- data.table(H_Code = c("AS-01", "AS-02", "AS-02", "AS-02", "AS-02", "AS-02", "AS-03", "AS-03"), Rela_HH = c(10,4,3,2,1,5,2,1), Ethn = c("SEN", "MA", "FA", "FA", "MA", "MA", "NZ", "MA"))
hq3[, `:=` (count = .N), by = list(H_Code, Ethn)][, priority := all(1 %in% Rela_HH & !1 %in% count), by = list(H_Code, Ethn)][order(H_Code, Rela_HH, -count), ] # leave off the last [] section, it's here to show this output in order
H_Code Rela_HH Ethn count priority
1: AS-01 10 SEN 1 FALSE
2: AS-02 1 MA 2 TRUE
3: AS-02 2 FA 3 FALSE
4: AS-02 3 FA 3 FALSE
5: AS-02 4 FA 3 FALSE
6: AS-02 5 MA 2 TRUE
7: AS-03 1 MA 1 FALSE
8: AS-03 2 NZ 1 FALSE
hq3[order(H_Code, -priority, -count, Rela_HH), ][hq3[, .I[1], by = list(H_Code)]$V1]
H_Code Rela_HH Ethn count priority
1: AS-01 10 SEN 1 FALSE
2: AS-02 1 MA 2 TRUE
3: AS-03 1 MA 1 FALSE
基准测试
哪个更快? Data.table,大约是 2 倍,并且在大表上设置了键,可能更多。
> microbenchmark(DTmeth(hq1), dplyrmeth(hq2), neval = 1000)
Unit: nanoseconds
expr min lq mean median uq max neval cld
DTmeth(hq1) 624865 641316 665616.17 654145.5 677087 847038 100 b
dplyrmeth(hq2) 1144980 1161583 1202274.35 1180147.5 1223617 1651814 100 c
neval 0 0 3.56 1.0 1 303 100 a
修改后的代码测试(我很好奇data.table 是否有更好的方法来做到这一点,因为它现在与dplyr 差不多):
Unit: nanoseconds
expr min lq mean median uq max neval cld
DTmeth(hq5) 653542 683727.5 773249.70 706670.5 747120.5 1791880 100 b
DT2meth(hq5) 1670831 1816633.0 1947881.22 1874742.5 2040164.5 2892786 100 c
dplyrmeth(hq3) 1169733 1232672.5 1587836.22 1281876.5 1367757.5 24089844 100 c
dplyr2meth(hq3) 1414848 1506917.5 1903192.98 1541481.5 1631438.0 26743551 100 c
neval 0 1.0 18.88 1.0 1.0 303 100 a