Dplyr 根据从另一个数据集引用的值过滤数据集，返回所有行或不返回行答案

【问题标题】：Dplyr filter dataset based on values referenced from another dataset, returns all or no rowsDplyr 根据从另一个数据集引用的值过滤数据集，返回所有行或不返回行
【发布时间】：2021-09-04 15:33:32
【问题描述】：

我在根据从另一个数据集引用的值过滤数据集时遇到问题。

我有两个数据集。第一个数据集 compare_dt 包含我应该以行的形式与 location1、location2 进行的所有比较。第二个数据集rain_values_dt 包含在不同时间从这些位置收集的值。我的目标是，对于 compare_dt 中的每一行，过滤掉从 location1 收集的 rain_values_dt 行，过滤掉从 location2 收集的 rain_values_dt 行，内部连接这些行，运行配对 t 检验，并将测试统计信息返回到附加的列到 compare_dt。

我遇到的问题是我无法根据从 compare_dt 引用的位置名称过滤 rain_values_dt 行。要求根据存储在比较表的第一行中的名称进行过滤，将返回 rain_values_dt 的所有行。要求根据存储在较高行号中的名称进行过滤不会返回任何内容。我只想接收来自我在过滤器中引用的网站的行。


library(data.table)
library(dplyr)

comparison_dt <- data.table(
  location1= c('austin_tx','austin_tx','austin_tx','boston_ma','boston_ma','boston_ma','chicago_il','chicago_il','chicago_il'),
  location2= c('austin_tx','boston_ma','chicago_il','austin_tx','boston_ma','chicago_il','austin_tx','boston_ma','chicago_il'),
  test_statistic= c()
)

rain_values_dt <- data.table(
  location=c('austin_tx','austin_tx','austin_tx','boston_ma','boston_ma','boston_ma','chicago_il','chicago_il','chicago_il'),
  month=c('march','april','may','march','april','may','march','april','may'),
  rainfall=c(1:9)
)

row_n=1

#my intended result, works as expected v
dplyr::filter(rain_values_dt, location == 'austin_tx')

#is pulling the correct name from the comparison table to filter on
comparison_dt[row_n,'location1']

#these are equivalent to each other, so I should be able to substitute, right?
'austin_tx' == comparison_dt[row_n,'location1']

#does not work, returns all values instead of filtering
dplyr::filter(rain_values_dt, location == comparison_dt[row_n,'location1'])

这是对较大数据集的简化，其中并非所有站点比较都有效，试验必须根据许多不同的条件进行匹配，并且每个站点的试验数量不均匀。

这之前按预期工作。我重新启动了 R 会话，但它不再按预期工作。

基于我可能以不同方式导入数据集的想法，我尝试将任一数据集中的位置名称更改为字符或函数类型。我尝试将位置列引用为向量或引号。我尝试卸载和重新加载 dplyr 并检查 R 是使用过滤器的基本统计版本还是 dplyr 版本。这似乎是一个简单的问题，但我搜索了这个站点和 filter() 文档，并没有找到为什么该函数可能会以这种方式运行的答案。

【问题讨论】：

标签： r dplyr filtering

【解决方案1】：

== 右侧的对象是一个data.table。

class(comparison_dt[row_n,'location1'])
[1] "data.table" "data.frame"

我们需要将该列提取为vector。使用$ 或[[

dplyr::filter(rain_values_dt, location == 
            comparison_dt[row_n,'location1']$location1)
     location month rainfall
1: austin_tx march        1
2: austin_tx april        2
3: austin_tx   may        3

甚至unlist 创建一个vector

dplyr::filter(rain_values_dt, location == 
            unlist(comparison_dt[row_n,'location1']))
    location month rainfall
1: austin_tx march        1
2: austin_tx april        2
3: austin_tx   may        3

关于我们为什么要获取数据集的所有行 - 'location1' 的第一个元素是 'austin_tx'，它也是来自 'rank_values_dt' 的 'location' 的第一个元素。因此，它是来自== 的TRUE，它会被回收

comparison_dt[row_n,'location1']
location1
1: austin_tx

假设，如果列值为'boston_ma'作为第一个元素，它将返回0行，因为与第一个元素比较的元素比较返回FALSE

dplyr::filter(rain_values_dt, location == data.table(location1 = 'boston_ma'))
Empty data.table (0 rows and 3 cols): location,month,rainfall
dplyr::filter(rain_values_dt, location == comparison_dt[row_n,'location1'])
     location month rainfall
1:  austin_tx march        1
2:  austin_tx april        2
3:  austin_tx   may        3
4:  boston_ma march        4
5:  boston_ma april        5
6:  boston_ma   may        6
7: chicago_il march        7
8: chicago_il april        8
9: chicago_il   may        9

即如果我们将表达式从filter 中取出，它会变得更加清晰——单个 TRUE/FALSE 输出，被回收

rain_values_dt$location == data.table(location1 = 'boston_ma')
     location1
[1,]     FALSE
rain_values_dt$location == comparison_dt[row_n,'location1']
     location1
[1,]      TRUE

对于data.frame/data.table/tibble，单位是一列。因此，comparison_dt[, 'location1'] 的 length 为 1。如果我们向 'comparison_dt' 添加更多行，则元素比较行为会更加明显

rain_values_dt$location == comparison_dt[3:5,'location1']
     location1
[1,]      TRUE
[2,]     FALSE
[3,]     FALSE

即第一个元素是 TRUE，因为它将来自 rain_values_dt' 的“location”的第一个元素与比较的第三个元素进行比较，但下一个元素是 FALSE，因为它是 'boston_ma' 与 rain_values_dt$location 的第二个元素相比，它又是 'austin_tx '

【讨论】：