【问题标题】:Filter a data frame by two conditions in R通过 R 中的两个条件过滤数据框
【发布时间】:2018-09-06 20:58:15
【问题描述】:

我有一个数据框,其中包含气候站在给定年份的最高和最低温度 - All.Stations 数据集:

Station.Name    Year    Month   Day TMAX    TMIN
GRAND MARAIS    1942    7       28    82      60
GRAND MARAIS    1962    3       17    42      22
LEECH LAKE      1956    7       3     72      50
ALBERT LEA 3 SE 1998    1       25    25      15
TWO HARBORS     1933    5       20    77      42
ARGYLE          1922    9       13    NA      NA

我还有一个完整年份的气候站数据框(即,这些年份我有一年中每一天的数据)-complete.years dataset:

Station.Name    Year
DULUTH          1904
AGASSIZ REFUGE  1995
LEECH LAKE      1956
GRAND MARAIS    1942
LEECH LAKE      1994

我想将第一个数据帧过滤为仅存在 Station Name 和 Year 并在第二个数据帧中匹配的数据。

正确的结果是:

Station.Name    Year TMAX
GRAND MARAIS    1942   82
LEECH LAKE      1956   72

这是我目前使用 dplyr 所做的:

Max.Tempurature <- All_Stations %>% 
  group_by(Station.Name, Year) %>%
  select(Station.Name, Year, TMAX) %>%
  filter(min_rank(desc(TMAX)) <= 1) %>%
  filter((Year %in% complete.years$Year & Station.Name %in% complete.years$Station.Name))

我可以同时按 Year 和 Station.Name 进行过滤,但这会在整个数据框中搜索匹配项。

如何按同一观察中存在的 Station.Name 和 Year 进行过滤?

【问题讨论】:

  • 绝对inner_join 是您最好的选择。查看@akrun 的回答

标签: r dplyr


【解决方案1】:

我们可以做一个inner_join

library(dplyr)
inner_join(All.Stations[c(1, 2, 5)], complete.years)
#   Station.Name Year TMAX
#1 GRAND MARAIS 1942   82
#2   LEECH LAKE 1956   72

数据

All.Stations <- structure(list(Station.Name = c("GRAND MARAIS", "GRAND MARAIS", 
"LEECH LAKE", "ALBERT LEA 3 SE", "TWO HARBORS", "ARGYLE"), Year = c(1942L, 
1962L, 1956L, 1998L, 1933L, 1922L), Month = c(7L, 3L, 7L, 1L, 
5L, 9L), Day = c(28L, 17L, 3L, 25L, 20L, 13L), TMAX = c(82L, 
42L, 72L, 25L, 77L, NA), TMIN = c(60L, 22L, 50L, 15L, 42L, NA
)), class = "data.frame", row.names = c(NA, -6L))

complete.years <- structure(list(Station.Name = c("DULUTH", 
    "AGASSIZ REFUGE", "LEECH LAKE", 
"GRAND MARAIS", "LEECH LAKE"), Year = c(1904L, 1995L, 1956L, 
1942L, 1994L)), class = "data.frame", row.names = c(NA, -5L))

【讨论】:

  • 您也可以使用semi_join,因为所有感兴趣的信息都在 All.Stations 中,而其他数据仅用于确定要包含的案例。 All.Stations %&gt;% semi_join(complete.years, by = c("Station.Name", "Year")) %&gt;% select(Station.Name, Year, TMAX)
  • 两者都有效,谢谢!我最终使用了 semi_join,因为我只想要 X 中的列,并且为了简单起见,我从示例中排除了 Y 中的其他列。再次感谢!!
【解决方案2】:

或者merge

cols <- c('Station.Name', 'Year', 'TMAX')
merge(All.Stations[cols], complete.years, all.x = FALSE)
#  Station.Name Year TMAX
#1 GRAND MARAIS 1942   82
#2   LEECH LAKE 1956   72

数据

All.Stations <- structure(list(Station.Name = c("GRAND MARAIS", "GRAND MARAIS", 
"LEECH LAKE", "ALBERT LEA 3 SE", "TWO HARBORS", "ARGYLE"), Year = c(1942L, 
1962L, 1956L, 1998L, 1933L, 1922L), Month = c(7L, 3L, 7L, 1L, 
5L, 9L), Day = c(28L, 17L, 3L, 25L, 20L, 13L), TMAX = c(82L, 
42L, 72L, 25L, 77L, NA), TMIN = c(60L, 22L, 50L, 15L, 42L, NA
)), .Names = c("Station.Name", "Year", "Month", "Day", "TMAX", 
"TMIN"), class = "data.frame", row.names = c(NA, -6L))

complete.years <- structure(list(Station.Name = c("DULUTH", "AGASSIZ REFUGE", "LEECH LAKE", 
"GRAND MARAIS", "LEECH LAKE"), Year = c(1904L, 1995L, 1956L, 
1942L, 1994L)), .Names = c("Station.Name", "Year"), class = "data.frame", row.names = c(NA, 
-5L))

【讨论】:

    猜你喜欢
    • 2021-08-20
    • 2015-02-17
    • 1970-01-01
    • 2021-12-13
    • 2014-11-18
    • 2020-05-02
    • 1970-01-01
    • 2021-01-21
    • 2018-09-15
    相关资源
    最近更新 更多