【问题标题】:Multi-steps forecasting with dplyr and do使用 dplyr 进行多步预测并执行
【发布时间】:2015-09-17 06:09:21
【问题描述】:

dplyr 中的 do-function 可以让您快速轻松地制作许多很酷的模型,但我正在努力使用这些模型来进行良好的滚动预测。

# Data illustration

require(dplyr)
require(forecast)

df <- data.frame(
  Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"), 
                    to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))

  df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1, 
                      Wind = runif(4320, min = 1, max = 5000), 
                      Temp = runif(4320, min = - 20, max = 25), 
                      Price = runif(4320, min = -15, max = 45)
                      )

我的因子变量是Hour,我的外生变量是Windtemp,我要预测的是Price。所以,基本上,我有 24 个模型可以用来进行滚动预测。

现在,我的数据框包含 180 天。我想回到 100 天,做一个 1 天的滚动预测,然后能够将其与实际的 Price 进行比较。

这种蛮力操作看起来像这样:

# First I fit the data frame to be exactly the right length
# 100 days to start with (2015-03-21 or so), then 99, then 98.., etc. 
n <- 100 * 24

# Make the price <- NA so I can replace it with a forecast
df$Price[(nrow(df) - n): (nrow(df) - n + 24)] <- NA

# Now I make df just 81 days long, the estimation period + the first forecast
df <- df[1 : (nrow(df) - n + 24), ]

# The actual do & fit, later termed fx(df)

result <- df %>% group_by(Hour) %>% do ({
  historical <- .[!is.na(.$Price), ]
  forecasted <- .[is.na(.$Price), c("Date", "Hour", "Wind", "Temp")]
  fit <- Arima(historical$Price, xreg = historical[, 3:4], order = c(1, 1, 0))
  data.frame(forecasted[], 
             Price = forecast.Arima(fit, xreg = forecasted[3:4])$mean )
})

result

现在我将 n 更改为 99 * 24。但是将它放在循环或应用中会很棒,但我根本不知道该怎么做,还要保存每个新的预测。

我试过这样的循环,但还没有运气:

# 100 days ago, forecast that day, then the next, etc.
for (n in 1:100) { 
  nx <- n * 24 * 80         # Because I want to start after 80 days
  df[nx:(nx + 23), 5] <- NA # Set prices to NA so I can forecast them
  fx(df) # do the function
  df.results[n] <- # Write the results into a vector / data frame to save them
    # and now rinse and repeat for n + 1
  }

broom-like 解决方案的真正精彩奖励积分 :)

【问题讨论】:

    标签: r dplyr apply forecasting


    【解决方案1】:

    首先我会注意到您的 for 循环中存在错误。你可能指的是(n+80)*24,而不是n*24*80。如果您还想包含第 81 天的预测,循环中的计数器也应该从 0 到 99 而不是 1 到 100。

    我将尝试在下面为您的问题提供一个优雅的解决方案。首先,我们以与您在帖子中所做的完全相同的方式定义我们的测试数据框:

    set.seed(2)
    df <- data.frame(
    Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"), 
                        to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
    df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1, 
                        Wind = runif(4320, min = 1, max = 5000), 
                        Temp = runif(4320, min = - 20, max = 25), 
                        Price = runif(4320, min = -15, max = 45)
    )
    

    接下来,我们定义一个函数来执行特定日期的预测。输入参数是正在考虑的数据框和训练集中应该包含的最少训练天数(在本例中 = 80)。 minTrainingDays+offSet+1 代表我们预测的实际日期。请注意,我们从 0 开始计算偏移量。

    forecastOneDay <- function(theData,minTrainingDays,offset)
    {
      nrTrainingRows <- (minTrainingDays+offset)*24
    
      theForecast <- theData %>% 
        filter(min_rank(Date) <= nrTrainingRows+24) %>% # Drop future data that we don't need
        group_by(Hour) %>%
        do ({
          trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
          forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
          fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
          data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
        })
    }
    

    我们想要预测第 81-180 天。换句话说,我们的训练集中至少需要 80 天,并且想要计算偏移量0:99 的函数结果。这可以通过一个简单的lapply 调用来完成。最后,我们将所有结果合并到一个数据框中:

    # Perform one day forecasts for days 81-180
    resultList <- lapply(0:99, function(x) forecastOneDay(df,80,x))
    # Merge all the results
    mergedForecasts <- do.call("rbind",resultList)
    

    编辑 在查看了您的帖子和同时发布的另一个答案后,我注意到我的答案存在两个潜在问题。首先,您需要一个包含 80 天训练数据的滚动窗口。但是,在我之前的代码中,所有可用的训练数据都用于拟合模型,而不是仅返回 80 天。其次,代码对 DST 更改不健壮。

    这两个问题在下面的代码中得到修复。现在函数的输入也更加直观:训练天数和实际预测天数可以作为输入参数。请注意,POSIXlt 数据格式在对日期执行操作时可以正确处理 DST、闰年等内容。因为您的数据框中的日期是 POSIXct 类型,我们需要来回进行一次小的类型转换才能正确处理。

    下面的新代码:

    forecastOneDay <- function(theData,nrTrainingDays,predictDay) # predictDay should be greater than nrTrainingDays
    {
      initialDate <- as.POSIXlt(theData$Date[1]); # First day (midnight hour)
      startDate <- initialDate # Beginning of training interval
      endDate <- initialDate # End of test interval
    
      startDate$mday <- initialDate$mday + (predictDay-nrTrainingDays-1) # Go back 80 days from predictday
      endDate$mday <- startDate$mday + (nrTrainingDays+1) # +1 to include prediction day
    
      theForecast <- theData %>% 
        filter(Date >= as.POSIXct(startDate),Date < as.POSIXct(endDate)) %>% 
        group_by(Hour) %>%
        do ({
          trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
          forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
          fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
          data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
        })
    }
    
    # Perform one day forecasts for days 81-180
    resultList <- lapply(81:180, function(x) forecastOneDay(df,80,x))
    # Merge all the results
    mergedForecasts <- do.call("rbind",resultList)
    

    结果如下所示:

    > head(mergedForecasts)
    Source: local data frame [6 x 6]
    Groups: Hour
    
                     Date Hour     Wind      Temp  realPrice predictedPrice
    1 2015-03-22 00:00:00    1 1691.589 -8.722152 -11.207139       5.918541
    2 2015-03-22 01:00:00    2 1790.928 18.098358   3.902686      37.885532
    3 2015-03-22 02:00:00    3 1457.195 10.166422  22.193270      34.984164
    4 2015-03-22 03:00:00    4 1414.502  4.993783   6.370435      12.037642
    5 2015-03-22 04:00:00    5 3020.755  9.540715  25.440357      -1.030102
    6 2015-03-22 05:00:00    6 4102.651  2.446729  33.528199      39.607848
    > tail(mergedForecasts)
    Source: local data frame [6 x 6]
    Groups: Hour
    
                     Date Hour      Wind       Temp  realPrice predictedPrice
    1 2015-06-29 18:00:00   19 1521.9609 13.6414797  12.884175     -6.7789109
    2 2015-06-29 19:00:00   20  555.1534  3.4758159  37.958768     -5.1193514
    3 2015-06-29 20:00:00   21 4337.6605  4.7242352  -9.244882     33.6817379
    4 2015-06-29 21:00:00   22 3140.1531  0.8127839  15.825230     -0.4625457
    5 2015-06-29 22:00:00   23 1389.0330 20.4667234 -14.802268     15.6755880
    6 2015-06-29 23:00:00   24  763.0704  9.1646139  23.407525      3.8214642
    

    【讨论】:

    • 很好的答案,正是我想要的!
    【解决方案2】:

    可以使用 dplyr 创建一个“滚动”的 data.frame,如下所示

    library(dplyr)
    library(lubridate)
    
    WINDOW_SIZE_DAYS <- 80
    
    df2 <- df %>%
      mutate(Day = yday(Date)) %>%
      replicate( n = WINDOW_SIZE_DAYS, simplify = FALSE ) %>% 
      bind_rows %>%
      group_by(Date) %>%
      mutate(Replica_Num = 1:n() ) %>%
      mutate(Day_Group_id = Day + Replica_Num - 1 ) %>%
      ungroup() %>%
      group_by(Day_Group_id) %>%
      filter( n() >= 24*WINDOW_SIZE_DAYS - 1 ) %>%
      select( -Replica_Num ) %>%
      arrange(Date) %>%
      ungroup()
    

    基本上,此代码会根据需要复制观察结果,并为每个 80 天的块分配相应的 Day_Group_id。 这样做是允许使用group_by(Day_Group_id) 以便在每个 80 天的块上单独运行模型。

    随后,可以根据需要使用它。例如,只需从上面复制/粘贴 arima 代码,如下所示:

    df3 <- df2 %>%
      group_by(Day_Group_id, Hour) %>%
      arrange(Date) %>%
      do ({
        trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
        forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
        fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
        data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
      })
    

    请注意:

    这里使用filter(n() &gt;= 24*WINDOW_SIZE_DAYS - 1) 代替filter(n() == 24*WINDOW_SIZE_DAYS) 以选择完整的80 天窗口。这是由于 2015-03-08 的夏令时调整所致。数据集中不存在小时 2015-03-08 02:00:00,因为它从 2015-03-08 01:00:00 直接跳转到 2015-03-08 03:00:00

    【讨论】:

      猜你喜欢
      • 2013-01-23
      • 2021-12-15
      • 2021-02-01
      • 2017-09-14
      • 1970-01-01
      • 2022-11-22
      • 2023-03-03
      • 2021-03-18
      • 1970-01-01
      相关资源
      最近更新 更多