【问题标题】:Drop ID with NA in a conditional group在条件组中删除带有 NA 的 ID
【发布时间】:2018-10-30 08:38:47
【问题描述】:

扩展this问题:

我使用以下代码准备了一些数据:

# # Data Preparation ----------------------
library(lubridate)
start_date <- "2018-10-30 00:00:00"
start_date <- as.POSIXct(start_date, origin="1970-01-01")
dates <- c(start_date)
for(i in 1:287) {
    dates <- c(dates, start_date + minutes(i * 10))
}
dates <- as.POSIXct(dates, origin="1970-01-01")
date_val <- format(dates, '%d-%m-%Y')

weather.forecast.data <- data.frame(dateTime = dates, date = date_val)
weather.forecast.data <- rbind(weather.forecast.data, weather.forecast.data, weather.forecast.data, weather.forecast.data)
weather.forecast.data$id <- c(rep('GH1', 288), rep('GH2', 288), rep('GH3', 288), rep('GH4', 288))
weather.forecast.data$radiation <- round(runif(nrow(weather.forecast.data)), 2)

weather.forecast.data$hour <- as.integer(format(weather.forecast.data$dateTime, '%H'))
weather.forecast.data$day_night <- ifelse(weather.forecast.data$hour < 6, 'night', ifelse(weather.forecast.data$hour < 19, 'day', 'night'))

# # GH2: Total Morning missing # #
weather.forecast.data$radiation[(weather.forecast.data$id == 'GH2') & (weather.forecast.data$date == '30-10-2018') & (weather.forecast.data$day_night == 'day')] = NA
weather.forecast.data$hour <- NULL
weather.forecast.data$day_night <- NULL

我的任务是从 weather.forecast.data 中删除 id,其中对于每个 id 和每个日期,早上一半(06 小时到 18 小时),使用 R 中的 dplyr 缺少辐射值(NA) .

我想消除给定 iddate 的行,其中整个上午的 radiation 值缺失。即,如果 date 的 id 缺少早晨 radiation。我删除了具有特定iddate 的所有行。因此,我们删除了所有 144 条记录,因为它的早晨缺少辐射。

我们可以看到GH2 在日期30-10-2018 缺少整个早晨的辐射。因此,我们删除了所有带有 id == 'GH2'date = '30-10-2018' 的 144 条记录。

setDT(weather.forecast.data)
weather.forecast.data[, sum(is.na(radiation)), .(id, date)]
    id       date V1
1: GH1 30-10-2018  0
2: GH1 31-10-2018  0
3: GH2 30-10-2018 78
4: GH2 31-10-2018  0
5: GH3 30-10-2018  0
6: GH3 31-10-2018  0
7: GH4 30-10-2018  0
8: GH4 31-10-2018  0

我有使用data.table的代码:

setDT(weather.forecast.data)
weather.forecast.data[, hour:= hour(dateTime)]
weather.forecast.data[, day_night:=c("night", "day")[(6 <= hour & hour < 19) + 1L]]
weather.forecast.data[, date_id := paste(date, id, sep = "__")]
weather.forecast.data[, all_is_na := all(is.na(radiation)), .(date_id, day_night)]
weather.forecast.data[!(date_id %in% unique(weather.forecast.data[(all_is_na == TRUE) & (day_night == 'day'), date_id]))]

我需要使用dplyr 的代码,并且我尝试了以下方法。它删除的行数超出了要求:

library(dplyr)
weather.forecast.data <- weather.forecast.data %>%
    mutate(hour = as.integer(format(dateTime, '%H'))) %>%
    mutate(day_night = ifelse(hour < 6, 'night', ifelse(hour < 19, 'day', 'night'))) %>%
    group_by(date, day_night, id) %>%
    filter((!all(is.na(radiation))) & (day_night == 'day')) %>%
    select (-c(hour, day_night)) %>%
    as.data.frame

注意:输出应通过删除 id = 'GH2'date = '30-10-2018' 所在的行来返回数据

【问题讨论】:

    标签: r dplyr data.table


    【解决方案1】:

    我相信你有点复杂。以下代码执行您在问题中描述的内容。

    library(lubridate)
    library(dplyr)
    
    weather.forecast.data %>%
      mutate(hour = hour(dateTime),
             day_night = c("night", "day")[(6 <= hour & hour < 19) + 1L]) %>%
      group_by(date, id) %>%
      mutate(delete = all(!(is.na(radiation) & day_night == "day"))) %>%
      ungroup() %>%
      filter(delete) %>%
      select(-hour, -day_night, -delete) %>%
      as.data.frame() -> df1
    

    看看它是否能提供预期的 144 行已删除的行。

    nrow(weather.forecast.data) - nrow(df1)
    #[1] 144
    

    数据。

    我重新发布数据生成代码,在两个地方进行了简化,并调用set.seed

    set.seed(4192)
    
    start_date <- "2018-10-30 00:00:00"
    start_date <- as.POSIXct(start_date, origin="1970-01-01")
    dates <- start_date + minutes(0:287 * 10)
    dates <- as.POSIXct(dates, origin="1970-01-01")
    date_val <- format(dates, '%d-%m-%Y')
    
    weather.forecast.data <- data.frame(dateTime = dates, date = date_val)
    weather.forecast.data <- rbind(weather.forecast.data, weather.forecast.data, weather.forecast.data, weather.forecast.data)
    weather.forecast.data$id <- c(rep('GH1', 288), rep('GH2', 288), rep('GH3', 288), rep('GH4', 288))
    weather.forecast.data$radiation <- round(runif(nrow(weather.forecast.data)), 2)
    
    weather.forecast.data$hour <- hour(weather.forecast.data$dateTime)
    weather.forecast.data$day_night <- ifelse(weather.forecast.data$hour < 6, 'night', ifelse(weather.forecast.data$hour < 19, 'day', 'night'))
    
    # # GH2: Total Morning missing # #
    weather.forecast.data$radiation[(weather.forecast.data$id == 'GH2') & (weather.forecast.data$date == '30-10-2018') & (weather.forecast.data$day_night == 'day')] = NA
    weather.forecast.data$hour <- NULL
    weather.forecast.data$day_night <- NULL
    

    【讨论】:

    • 您的代码没有删除任何记录。理想情况下,它应该删除id = GH2date = 30-10-2018 的记录,因为它丢失了整个早晨的辐射。
    • @KartheekPalepu 对不起,你是对的。我把c("night", "day") 颠倒过来了。错误已更正。
    • 代码仍然没有删除任何记录。另外,我担心您是按日期和 id 分组并检查整个组是否为 NA。我只需要白天部分是 NA。 (如果这有助于更新您的代码)。我需要删除 id = GH2 和 date = 30-10-2018 的整个行,因为该组的整个早晨辐射数据丢失。
    • @KartheekPalepu 我相信这次你错了,我的代码丢了 78 条记录。如果将管道分配给df1 并比较weather.forecast.datadf1 的暗淡,则差异等于sum(weather.forecast.data$id == 'GH2' &amp; weather.forecast.data$date == '30-10-2018' &amp; is.na(weather.forecast.data$radiation))
    • 抱歉,正在删除 78 条记录。它需要消除该 id 和日期的所有 144 条记录。 (其中每个 id 每个日期有 144 条记录)
    【解决方案2】:

    您正在过滤 day_night 列中仅包含“day”的行。如果我理解正确,您需要以下内容:

        library(dplyr)
        weather.forecast.data <- weather.forecast.data %>%
          mutate(hour = as.integer(format(dateTime, '%H'))) %>%
          mutate(day_night = ifelse(hour < 6, 'night', ifelse(hour < 19, 'day', 
                                                             'night'))) %>%
          group_by(date, day_night, id) %>%
          filter((!(all(is.na(radiation))) & (day_night == 'day'))) %>%
          select (-c(hour, day_night)) %>%
          as.data.frame
    

    这将删除白天具有所有 NA 的所有 ID。

    【讨论】:

    • 您提供的代码正在删除所有 ID 的所有夜间数据。理想情况下,它只需要删除id = GH2date = 30-10-2018 的数据点。
    • 你是对的。正如 Rui Barradas 在他的回答。
    猜你喜欢
    • 2011-08-17
    • 1970-01-01
    • 2020-08-08
    • 2017-06-12
    • 2013-11-25
    • 1970-01-01
    • 2016-06-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多