dplyr::filter 函数生成错误的结果答案

【问题标题】：dplyr::filter function generates wrong resultsdplyr::filter 函数生成错误的结果
【发布时间】：2021-12-28 16:49:53
【问题描述】：

我正在使用dplyr::filter 函数根据Sex,Patient.Age,Country.where.Event.occurred 三个变量过滤数据，第一个代码段生成正确的结果，第二个代码段生成错误的结果。但是，从我的角度来看，两个代码部分都有相同的表达式，所以我很困惑为什么结果不同。

> data
# A tibble: 1,360 × 3
   Sex    Patient.Age Country.where.Event.occurred
   <chr>  <chr>       <chr>                       
 1 Female 12 YR       US                          
 2 Female 16 YR       KW                          
 3 Female 16 YR       US                          
 4 Female 16 YR       US                          
 5 Female 16 YR       US                          
 6 Female 16 YR       US                          
 7 Female 17 YR       ES                          
 8 Female 17 YR       ES                          
 9 Female 17 YR       GB                          
10 Female 19 YR       CA                          
# … with 1,350 more rows

# unique combination of 3 variables
> key <- data %>% 
+   distinct(Sex, Patient.Age,Country.where.Event.occurred)
> key
# A tibble: 399 × 3
   Sex    Patient.Age Country.where.Event.occurred
   <chr>  <chr>       <chr>                       
 1 Female 12 YR       US                          
 2 Female 16 YR       KW                          
 3 Female 16 YR       US                          
 4 Female 17 YR       ES                          
 5 Female 17 YR       GB                          
 6 Female 19 YR       CA                          
 7 Female 19 YR       US                          
 8 Female 2 YR        US                          
 9 Female 26 YR       US                          
10 Female 28 YR       US                          
# … with 389 more rows

> data %>%
+   filter(Sex == key[3,]$Sex,
+          Patient.Age == key[3,]$Patient.Age,
+          Country.where.Event.occurred == key[3,]$Country.where.Event.occurred)
# A tibble: 4 × 3
  Sex    Patient.Age Country.where.Event.occurred
  <chr>  <chr>       <chr>                       
1 Female 16 YR       US                          
2 Female 16 YR       US                          
3 Female 16 YR       US                          
4 Female 16 YR       US

> Sex <- key[3,]$Sex
> Sex
[1] "Female"
> Age <- key[3,]$Patient.Age
> Age
[1] "16 YR"
> Country <- key[3,]$Country.where.Event.occurred
> Country
[1] "US"
> data %>%
+   filter(Sex == Sex,
+          Patient.Age == Age,
+          Country.where.Event.occurred == Country)
# A tibble: 7 × 3
  Sex    Patient.Age Country.where.Event.occurred
  <chr>  <chr>       <chr>                       
1 Female 16 YR       US                          
2 Female 16 YR       US                          
3 Female 16 YR       US                          
4 Female 16 YR       US                          
5 Male   16 YR       US                          
6 Male   16 YR       US                          
7 Male   16 YR       US

【问题讨论】：

标签： r dplyr environment-variables

【解决方案1】：

第二个例子中的问题可能是filter(Sex == Sex...这一行。

左右两边的术语Sex 被解释为数据集中的Sex 变量。它总是会匹配自己，因此那部分总是正确的。

我认为您打算将左侧设为“女性”（从您与其他两个变量的模式来看。

要更深入地了解这一点，我建议多读几遍Programming with dplyr 小插图。至少对我来说，我每次都会学习/重新学习一两个金块。对于您的具体问题，“数据屏蔽”部分是相关的。

数据屏蔽背后的关键思想是它模糊了“变量”一词的两种不同含义之间的界限：

env-variables 是存在于环境中的“编程”变量。它们通常使用创建

数据变量是存在于数据框中的“统计”变量。它们通常来自数据文件（例如 .csv、.xls），或者是通过操纵现有变量而创建的。

...

我认为这种“变量”含义的模糊是一个非常好的功能......

很遗憾，此福利并非免费提供...

【讨论】：

查看我刚刚添加的参考。它会更好地解释事情。