多条件下的数据框子集答案

【问题标题】：Data frame subsetting under multiple conditions多条件下的数据框子集
【发布时间】：2020-01-11 20:47:48
【问题描述】：

我有一个数据框，其中包含有关巴西地方选举的前两名候选人的信息，如下所示：

Name <- c('Andressa', 'Marcos', 'Anderson', 'Cibelle', 'Ivy', 'Eliana')
Municipality <- c('A', 'A', 'B', 'B', 'C', 'C')
Gender <- c('F', 'M', 'M', 'F', 'F', 'F')
Vote_Share <- c(51, 49, 55, 45, 70, 30)
data <- data.frame(Name, Municipality, Gender, Vote_Share)

Name       Municipality   Gender   Vote_Share 
Andressa         A           F         51
Marcos           A           M         49
Anderson         B           M         55
Cibelle          B           F         45
Ivy              C           F         70
Eliana           C           F         30

我只想在我的数据中保留有关以一男一女比赛的城市的信息。

所以，我正在寻找这样的输出：

Name       Municipality   Gender   Vote_Share 
Andressa         A           F         51
Marcos           A           M         49
Anderson         B           M         55
Cibelle          B           F         45

此外，我想创建另一个对象，其中包含在每个市政当局的选举中女性的胜率（女性的投票份额 - 男性的投票份额）：

Municipality     Win Margin
A                    2
B                    10

问候，

【问题讨论】：

标签： r dataframe dplyr

【解决方案1】：

在基础 R 中使用 ave 和 subset 的另一种方式

temp <- subset(data, as.logical(ave(Gender, Municipality, FUN = function(x) 
                all(c('F', 'M') %in% x))))

#      Name Municipality Gender Vote_Share
#1 Andressa            A      F         51
#2   Marcos            A      M         49
#3 Anderson            B      M         55
#4  Cibelle            B      F         45

然后用aggregate计算票差。

aggregate(Vote_Share~Municipality, temp, function(x) diff(range(x)))

#  Municipality Vote_Share
#1            A          2
#2            B         10

【讨论】：

【解决方案2】：

这是一个基本的 R 解决方案，使用 subset() + ave()

dfout <- subset(df,as.logical(ave(Gender,Municipality,FUN = function(x) length(unique(x))==2)))

或

dfout <- subset(df,as.logical(ave(Gender,Municipality,FUN = function(x) !any(duplicated(x)))))

这样

> dfout
      Name Municipality Gender Vote_Share
1 Andressa            A      F         51
2   Marcos            A      M         49
3 Anderson            B      M         55
4  Cibelle            B      F         45

【讨论】：

【解决方案3】：

您可以先factorize 然后as.numeric，取mean 并排除不是1 的地方。

dat[with(dat, ave(as.numeric(as.factor(Gender)), Municipality)) != 1, ]
#       Name Municipality Gender Vote_Share
# 1 Andressa            A      F         51
# 2   Marcos            A      M         49
# 3 Anderson            B      M         55
# 4  Cibelle            B      F         45

奇怪的是这也有效，因为F 是FALSE 和F 的简写，加上M 没有意义 --> NA，我们可以排除。

dat[is.na(with(dat, ave(as.logical(Gender), Municipality))), ]
#       Name Municipality Gender Vote_Share
# 1 Andressa            A      F         51
# 2   Marcos            A      M         49
# 3 Anderson            B      M         55
# 4  Cibelle            B      F         45

数据：

dat <- structure(list(Name = structure(c(2L, 6L, 1L, 3L, 5L, 4L), .Label = c("Anderson", 
"Andressa", "Cibelle", "Eliana", "Ivy", "Marcos"), class = "factor"), 
    Municipality = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), Gender = structure(c(1L, 2L, 
    2L, 1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), 
    Vote_Share = c(51, 49, 55, 45, 70, 30)), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】：

【解决方案4】：

我们可以按“市政”和filter 对具有两个不同“性别”值的组进行分组

library(dplyr)
out1 <- data %>%
           group_by(Municipality) %>%
           filter(n_distinct(Gender) == 2)
out1
# A tibble: 4 x 4
# Groups:   Municipality [2]
#  Name     Municipality Gender Vote_Share
#  <fct>    <fct>        <fct>       <dbl>
#1 Andressa A            F              51
#2 Marcos   A            M              49
#3 Anderson B            M              55
#4 Cibelle  B            F              45

或指定两个“级别”都在“性别”中

data %>%
    group_by(Municipality) %>%
    filter(all(c("M", "F") %in% Gender))

一旦我们得到第一个汇总输出

out1 %>%
   summarise(WinMargin = abs(diff(Vote_Share)))
# A tibble: 2 x 2
#  Municipality WinMargin
#  <fct>            <dbl>
#1 A                    2
#2 B                   10

有了data.table，我们可以做到

library(data.table)
setDT(data)[, .SD[uniqueN(Gender) == 2)], .(Municipality)
     ][, .(WinMargin = abs(diff(Vote_Share))), by = Municipality]

或者在base R中，我们可以使用subset和table

subset(data, Municipality %in%  names(which(rowSums(table(Municipality,
         Gender) > 0) > 1)))
#      Name Municipality Gender Vote_Share
#1 Andressa            A      F         51
#2   Marcos            A      M         49
#3 Anderson            B      M         55
#4  Cibelle            B      F         45

【讨论】：