在条件组中删除带有 NA 的 ID答案

【问题标题】：Drop ID with NA in a conditional group在条件组中删除带有 NA 的 ID
【发布时间】：2018-10-30 08:38:47
【问题描述】：

扩展this问题：

我使用以下代码准备了一些数据：

# # Data Preparation ----------------------
library(lubridate)
start_date <- "2018-10-30 00:00:00"
start_date <- as.POSIXct(start_date, origin="1970-01-01")
dates <- c(start_date)
for(i in 1:287) {
    dates <- c(dates, start_date + minutes(i * 10))
}
dates <- as.POSIXct(dates, origin="1970-01-01")
date_val <- format(dates, '%d-%m-%Y')

weather.forecast.data <- data.frame(dateTime = dates, date = date_val)
weather.forecast.data <- rbind(weather.forecast.data, weather.forecast.data, weather.forecast.data, weather.forecast.data)
weather.forecast.data$id <- c(rep('GH1', 288), rep('GH2', 288), rep('GH3', 288), rep('GH4', 288))
weather.forecast.data$radiation <- round(runif(nrow(weather.forecast.data)), 2)

weather.forecast.data$hour <- as.integer(format(weather.forecast.data$dateTime, '%H'))
weather.forecast.data$day_night <- ifelse(weather.forecast.data$hour < 6, 'night', ifelse(weather.forecast.data$hour < 19, 'day', 'night'))

# # GH2: Total Morning missing # #
weather.forecast.data$radiation[(weather.forecast.data$id == 'GH2') & (weather.forecast.data$date == '30-10-2018') & (weather.forecast.data$day_night == 'day')] = NA
weather.forecast.data$hour <- NULL
weather.forecast.data$day_night <- NULL

我的任务是从 weather.forecast.data 中删除 id，其中对于每个 id 和每个日期，早上一半（06 小时到 18 小时），使用 R 中的 dplyr 缺少辐射值（NA） .

我想消除给定 id 和 date 的行，其中整个上午的 radiation 值缺失。即，如果 date 的 id 缺少早晨 radiation。我删除了具有特定id 和date 的所有行。因此，我们删除了所有 144 条记录，因为它的早晨缺少辐射。

我们可以看到GH2 在日期30-10-2018 缺少整个早晨的辐射。因此，我们删除了所有带有 id == 'GH2' 和 date = '30-10-2018' 的 144 条记录。

setDT(weather.forecast.data)
weather.forecast.data[, sum(is.na(radiation)), .(id, date)]
    id       date V1
1: GH1 30-10-2018  0
2: GH1 31-10-2018  0
3: GH2 30-10-2018 78
4: GH2 31-10-2018  0
5: GH3 30-10-2018  0
6: GH3 31-10-2018  0
7: GH4 30-10-2018  0
8: GH4 31-10-2018  0

我有使用data.table的代码：

setDT(weather.forecast.data)
weather.forecast.data[, hour:= hour(dateTime)]
weather.forecast.data[, day_night:=c("night", "day")[(6 <= hour & hour < 19) + 1L]]
weather.forecast.data[, date_id := paste(date, id, sep = "__")]
weather.forecast.data[, all_is_na := all(is.na(radiation)), .(date_id, day_night)]
weather.forecast.data[!(date_id %in% unique(weather.forecast.data[(all_is_na == TRUE) & (day_night == 'day'), date_id]))]

我需要使用dplyr 的代码，并且我尝试了以下方法。它删除的行数超出了要求：

library(dplyr)
weather.forecast.data <- weather.forecast.data %>%
    mutate(hour = as.integer(format(dateTime, '%H'))) %>%
    mutate(day_night = ifelse(hour < 6, 'night', ifelse(hour < 19, 'day', 'night'))) %>%
    group_by(date, day_night, id) %>%
    filter((!all(is.na(radiation))) & (day_night == 'day')) %>%
    select (-c(hour, day_night)) %>%
    as.data.frame

注意：输出应通过删除 id = 'GH2' 和 date = '30-10-2018' 所在的行来返回数据

【问题讨论】：

标签： r dplyr data.table

【解决方案1】：

我相信你有点复杂。以下代码执行您在问题中描述的内容。

library(lubridate)
library(dplyr)

weather.forecast.data %>%
  mutate(hour = hour(dateTime),
         day_night = c("night", "day")[(6 <= hour & hour < 19) + 1L]) %>%
  group_by(date, id) %>%
  mutate(delete = all(!(is.na(radiation) & day_night == "day"))) %>%
  ungroup() %>%
  filter(delete) %>%
  select(-hour, -day_night, -delete) %>%
  as.data.frame() -> df1

看看它是否能提供预期的 144 行已删除的行。

nrow(weather.forecast.data) - nrow(df1)
#[1] 144

数据。

我重新发布数据生成代码，在两个地方进行了简化，并调用set.seed。

set.seed(4192)

start_date <- "2018-10-30 00:00:00"
start_date <- as.POSIXct(start_date, origin="1970-01-01")
dates <- start_date + minutes(0:287 * 10)
dates <- as.POSIXct(dates, origin="1970-01-01")
date_val <- format(dates, '%d-%m-%Y')

weather.forecast.data <- data.frame(dateTime = dates, date = date_val)
weather.forecast.data <- rbind(weather.forecast.data, weather.forecast.data, weather.forecast.data, weather.forecast.data)
weather.forecast.data$id <- c(rep('GH1', 288), rep('GH2', 288), rep('GH3', 288), rep('GH4', 288))
weather.forecast.data$radiation <- round(runif(nrow(weather.forecast.data)), 2)

weather.forecast.data$hour <- hour(weather.forecast.data$dateTime)
weather.forecast.data$day_night <- ifelse(weather.forecast.data$hour < 6, 'night', ifelse(weather.forecast.data$hour < 19, 'day', 'night'))

# # GH2: Total Morning missing # #
weather.forecast.data$radiation[(weather.forecast.data$id == 'GH2') & (weather.forecast.data$date == '30-10-2018') & (weather.forecast.data$day_night == 'day')] = NA
weather.forecast.data$hour <- NULL
weather.forecast.data$day_night <- NULL

【讨论】：

您的代码没有删除任何记录。理想情况下，它应该删除id = GH2 和date = 30-10-2018 的记录，因为它丢失了整个早晨的辐射。
@KartheekPalepu 对不起，你是对的。我把c("night", "day") 颠倒过来了。错误已更正。
代码仍然没有删除任何记录。另外，我担心您是按日期和 id 分组并检查整个组是否为 NA。我只需要白天部分是 NA。（如果这有助于更新您的代码）。我需要删除 id = GH2 和 date = 30-10-2018 的整个行，因为该组的整个早晨辐射数据丢失。
@KartheekPalepu 我相信这次你错了，我的代码丢了 78 条记录。如果将管道分配给df1 并比较weather.forecast.data 和df1 的暗淡，则差异等于sum(weather.forecast.data$id == 'GH2' & weather.forecast.data$date == '30-10-2018' & is.na(weather.forecast.data$radiation))。
抱歉，正在删除 78 条记录。它需要消除该 id 和日期的所有 144 条记录。（其中每个 id 每个日期有 144 条记录）

【解决方案2】：

您正在过滤 day_night 列中仅包含“day”的行。如果我理解正确，您需要以下内容：

    library(dplyr)
    weather.forecast.data <- weather.forecast.data %>%
      mutate(hour = as.integer(format(dateTime, '%H'))) %>%
      mutate(day_night = ifelse(hour < 6, 'night', ifelse(hour < 19, 'day', 
                                                         'night'))) %>%
      group_by(date, day_night, id) %>%
      filter((!(all(is.na(radiation))) & (day_night == 'day'))) %>%
      select (-c(hour, day_night)) %>%
      as.data.frame

这将删除白天具有所有 NA 的所有 ID。

【讨论】：

您提供的代码正在删除所有 ID 的所有夜间数据。理想情况下，它只需要删除id = GH2 和date = 30-10-2018 的数据点。
你是对的。正如 Rui Barradas 在他的回答。