【问题标题】:Improve speed of drawdown.duration implementation提高 drawdown.duration 实施的速度
【发布时间】:2018-11-15 02:35:32
【问题描述】:

我有计算正在运行的drawdown.duration 的工作代码,其中drawdown.duration 定义为当前月份和上一个peak 之间的月数。但是,我将代码实现为for 循环,并且运行速度很慢。

R 中是否有更有效/更快的方法来实现这一点?

代码采用名为returnsWithValuesdata.frame(特别是tibble,因为我一直在使用dplyr)。

> structure(list(date = structure(c(789, 820, 850, 881, 911, 942
), class = "Date"), value = c(0.94031052, 0.930751624153046, 
0.926756311376762, 0.874209664097166, 0.843026010916249, 2.1), 
    peak = c(1, 1, 1, 1, 1, 2.1), drawdown = c(-0.05968948, -0.0692483758469535, 
    -0.0732436886232377, -0.125790335902834, -0.156973989083751, 
    0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))
# A tibble: 6 x 4
  date       value  peak drawdown
  <date>     <dbl> <dbl>    <dbl>
1 1972-02-29 0.940   1    -0.0597
2 1972-03-31 0.931   1    -0.0692
3 1972-04-30 0.927   1    -0.0732
4 1972-05-31 0.874   1    -0.126 
5 1972-06-30 0.843   1    -0.157 
6 1972-07-31 2.1     2.1   0   

我已经使用for 循环实现了drawdown.duration

returnsWithValues <- returnsWithValues %>% mutate(drawdown.duration = NA)

    # add drawdown.duration col
    for (row in 1:nrow(returnsWithValues)) {
        if(returnsWithValues[row,"value"] == returnsWithValues[row,"peak"]) {
            returnsWithValues[row,"drawdown.duration"] = 0
        } else {
            if(row == 1){
                returnsWithValues[row,"drawdown.duration"] = 1
            } else {
                returnsWithValues[row,"drawdown.duration"] = returnsWithValues[row - 1,"drawdown.duration"] + 1
            }
        }
    }

正确答案如下:

> returnsWithValues
# A tibble: 6 x 5
  date       value  peak drawdown drawdown.duration
  <date>     <dbl> <dbl>    <dbl>             <dbl>
1 1972-02-29 0.940   1    -0.0597                 1
2 1972-03-31 0.931   1    -0.0692                 2
3 1972-04-30 0.927   1    -0.0732                 3
4 1972-05-31 0.874   1    -0.126                  4
5 1972-06-30 0.843   1    -0.157                  5
6 1972-07-31 2.1     2.1   0                      0

【问题讨论】:

    标签: r performance dplyr


    【解决方案1】:

    我将根据需要删除 for 循环,并使用索引的想法。

    indices <- function(returnsWithValues){
        indices_logical<-(returnsWithValues[["value"]] == returnsWithValues[["peak"]]) #return a logical vector where true values are for equal and false for not.
        indices_to_zero<-which(indices_logical) # which values are true
        indices_drawdpwn<-which(!indices_logical) # which values are false
        returnsWithValues[indices_to_zero,"drawdown.duration"] <- 0
        returnsWithValues[indices_drawdpwn,"drawdown.duration"] <- 1:length(indices_drawdpwn) #basically you compute this if I understand correctly
        returnsWithValues
    

    这是一个包裹在函数中的 for 循环。

    for_loop<-function(returnsWithValues){
        # add drawdown.duration col
        for (row in 1:nrow(returnsWithValues)) {
            if(returnsWithValues[row,"value"] == returnsWithValues[row,"peak"]) {
                returnsWithValues[row,"drawdown.duration"] = 0
            } else {
                if(row == 1){
                    returnsWithValues[row,"drawdown.duration"] = 1
                } else {
                    returnsWithValues[row,"drawdown.duration"] = returnsWithValues[row - 1,"drawdown.duration"] + 1
                }
            }
        }
        returnsWithValues
    }
    

    这是与您的 for 循环相比的基准。

    microbenchmark::microbenchmark(
          "for loop" = flp<-for_loop(returnsWithValues),
          indices = ind<-indices(returnsWithValues),
          times = 10
    )
    
    Unit: microseconds
            expr      min       lq     mean    median       uq      max neval
        for loop 8671.228 8699.555 8857.198 8826.8185 8967.631 9196.708    10
         indices   92.781   99.349  106.328  102.8385  115.360  122.749    10
    all.equal(ind,flp)
    [1] TRUE
    

    【讨论】:

      【解决方案2】:

      我认为这可以做到,只要每个 peak 值是唯一的并且以后不会在另一个组中重复:

      returnsWithValues %>%
          group_by(peak) %>%
          mutate(drawdown.duration = cumsum(value != peak))
      

      如果您确实有重复的峰值,您可能需要一种方法在连续的 peak 值内进行分组,例如

      returns %>%
          # Start counting the number of groups at 1, and every time
          #   peak changes compared to the previous row, add 1
          mutate(peak_group = cumsum(c(1, peak[-1] != head(peak, -1)))) %>%
          group_by(peak_group) %>%
          mutate(drawdown.duration = cumsum(value != peak))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2023-03-21
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多