【问题标题】:Fill in values based on previous day in R根据前一天在R中填写值
【发布时间】:2021-12-13 19:24:10
【问题描述】:

我有一个如下所示的数据集:

Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0

我想填写 10-31 和 11-1 的值,以包含前一个交易日 (10-30) 的值。这如何在 R 中轻松完成?我觉得图书馆(tidyr)好像完全适合这张照片?

预期的表现形式是:

Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0

请求的 dput 输出

structure(list(Date = c("2020-10-28", "2020-10-29", "2020-10-30", 
"2020-11-02", "2020-11-03", "2020-11-04", "2020-11-05", "2020-11-06", 
"2020-11-09", "2020-11-10"), Open = c(1384L, 1297L, 1283L, 1284L, 
1263L, 1224L, 1194L, 1196L, 1207L, 1200L), High = c(1384L, 1297L, 
1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L), Low = c(1384L, 
1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
), Close = c(1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 
1196L, 1207L, 1200L), Adjusted_close = c(1384L, 1297L, 1283L, 
1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L), Volume = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 10L), class = "data.frame")

【问题讨论】:

  • 请通过复制dput(head(my_dataset, 10)) 的输出来提供您的数据集的可重现示例。还请提供所需输出的示例。
  • 每个周末缺2天,不算平日没有交易。我在周六和周日都填写了等于周五的值。
  • 明白了!抱歉,我最初误读了您的 dput() 示例 input 作为 output 的说明;所以我的评论有误,我已将其删除以避免混淆。不管怎样,my solution 在下面!
  • my solution 为你工作了吗?

标签: r dataframe dataset


【解决方案1】:

1) 使用read.zoo 转换为zoo 类系列z(这也将Date 转换为Date 类),然后将零宽度动物园对象与所有与z 约会。使用na.locf 填充缺失值,最后使用fortify.zoo 转换回数据框。如果结果是动物园对象没问题,则省略fortify.zoo 部分。

library(zoo)

z <- read.zoo(dat)
out1 <- merge(z, zoo(, seq(start(z), end(z), "day"))) |> 
  na.locf() |>
  fortify.zoo(name = "Date")

# check - target is defined in Note at the end
identical(out1, transform(target, Date = as.Date(Date)))
## [1] TRUE

2) 在这个替代方案中,我们使用以下管道。而不是像上面那样使用merge.zoo,而是转换为ts类并返回以扩展日期。

  1. dat 转换为zoo 类,这也将索引转换为Date 类。
  2. 然后将其转换为ts类。由于该类仅支持规则间隔的系列,因此转换将使用 NA 填充与缺失日期相对应的值。
  3. 然后na.locf 将填写这些 NA。
  4. 使用fortify.zoo将其转换回数据帧。
  5. 由于ts 类不支持日期索引,因此此时的日期列只是数字,因此请将它们转换回Date 类。
library(zoo)

out2 <- dat |> 
  read.zoo() |>
  as.ts() |>
  na.locf() |>
  fortify.zoo(name = "Date") |>
  transform(Date = as.Date(Date))

# check - target is defined in Note at the end    
identical(out2, transform(target, Date = as.Date(Date)))
## [1] TRUE

注意

假设可重现形式的输入 dat 和输出 target 为:

Lines <- "Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0"
dat <- read.csv(text = Lines, strip.white = TRUE)

Lines2 <- "Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0"
target <- read.csv(text = Lines2, strip.white = TRUE)

【讨论】:

    【解决方案2】:

    解决方案

    这是tidyverse 中的一个解决方案,其中leveragestidyr::fill() 函数用于填充前面行中的值:

    library(tidyverse)
    
    
    # ...
    # Code to generate 'my_data'.
    # ...
    
    
    my_data %>%
      # Ensure 'Date' column is proper datatype.
      mutate(Date = as.Date(Date)) %>%
      # Link to full range of dates, with blank rows for missing dates.
      right_join(
        # A temporary dataset with the full range of 'Date's.
        tibble(Date = seq(from = min(.$Date), to = max(.$Date), by = "days")),
        by = "Date"
      ) %>%
      # Sort for filling: earlier above later.
      arrange(Date) %>%
      # Fill blank rows with values above.
      fill(everything(), .direction = "down")
    

    结果

    鉴于my_data 喜欢这里转载的data.frame

    my_data <- structure(
      list(
        Date = c(
          "2020-10-28", "2020-10-29", "2020-10-30", "2020-11-02", "2020-11-03",
          "2020-11-04", "2020-11-05", "2020-11-06", "2020-11-09", "2020-11-10"
        ),
        Open = c(
          1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
        ),
        High = c(
          1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
        ),
        Low = c(
          1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
        ),
        Close = c(
          1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
        ),
        Adjusted_close = c(
          1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
        ),
        Volume = c(
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
        )
      ),
      row.names = c(NA, 10L),
      class = "data.frame"
    )
    

    这个解决方案应该产生一个像这样的data.frame

             Date Open High  Low Close Adjusted_close Volume
    1  2020-10-28 1384 1384 1384  1384           1384      0
    2  2020-10-29 1297 1297 1297  1297           1297      0
    3  2020-10-30 1283 1283 1283  1283           1283      0
    4  2020-10-31 1283 1283 1283  1283           1283      0
    5  2020-11-01 1283 1283 1283  1283           1283      0
    6  2020-11-02 1284 1284 1284  1284           1284      0
    7  2020-11-03 1263 1263 1263  1263           1263      0
    8  2020-11-04 1224 1224 1224  1224           1224      0
    9  2020-11-05 1194 1194 1194  1194           1194      0
    10 2020-11-06 1196 1196 1196  1196           1196      0
    11 2020-11-07 1196 1196 1196  1196           1196      0
    12 2020-11-08 1196 1196 1196  1196           1196      0
    13 2020-11-09 1207 1207 1207  1207           1207      0
    14 2020-11-10 1200 1200 1200  1200           1200      0
    

    【讨论】:

      【解决方案3】:

      第一个日期必须是日期格式

      df$Date = as.Date(df$Date)
      
      df %>% 
        full_join(data.frame(Date = seq(min(df$Date), max(df$Date), by = "days")),by = "Date") %>% 
        arrange(Date) %>% 
        fill(everything())
      

      然后与仅包含数据库中整个日期序列的数据进行连接,我们对其进行排序并使用填充函数来填充它们

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-10-07
        • 2023-02-25
        • 2022-07-08
        • 1970-01-01
        • 1970-01-01
        • 2021-11-14
        • 2022-01-04
        • 2021-07-01
        相关资源
        最近更新 更多