【问题标题】:Optimizing/alternative ddply, transform, and na.omit优化/替代 ddply、transform 和 na.omit
【发布时间】:2017-01-26 04:12:34
【问题描述】:

我有以下情况:

library(TTR)
library(scales)
library(dplyr)
library(tidyr)

#prepare data
df = data.frame(X=seq.int(100000), high = runif(100000, 1, 100), low = runif(100000, 1, 100), close = runif(100000, 1, 100))

#some calculation
df$cci14 = rescale(CCI(df[,c('high','low','close')], n=14, maType=SMA), to=c(0,100), from=c(-100,100))

#filtering
df$select = df$cci14 >=100 | lag(df$cci14)>=100 | lead(df$cci14)>=100 | df$cci14 <=0 | lag(df$cci14)<=0 | lead(df$cci14)<=0


ff = df %>% filter(select) %>% group_by(group1 = cumsum(c(1, diff(X) != 1))) %>% dplyr::mutate(len = NA) %>% dplyr::mutate(Y = seq(n())) %>% spread(Y, cci14) %>% ungroup()

#sync column values high,low,close
ff = (ff %>% group_by(group1) %>% mutate(X=first(X)) %>% mutate(high=max(high))  %>% mutate(low=min(low))   %>% mutate(close=last(close))  )

library(plyr) # have to detach afterward, without this, ddply runs with unexpected result

#this one very slow, any alternative?
ff %>% group_by(group1) 
     %>% ddply(.(group1), transform, `1`=na.omit(`1`)[1])
     %>% ddply(.(group1), transform, X2=na.omit(X2)[1]) 
     %>% ddply(.(group1), transform, X3=na.omit(X3)[1]) 
     %>% ddply(.(group1), transform, X4=na.omit(X4)[1]) 
     %>% ddply(.(group1), transform, X5=na.omit(X5)[1]) 
     %>% ddply(.(group1), transform, X6=na.omit(X6)[1]) 
     %>% ddply(.(group1), transform, X7=na.omit(X7)[1]) 
     %>% ddply(.(group1), transform, X8=na.omit(X8)[1]) 
     %>% ddply(.(group1), transform, X9=na.omit(X9)[1]) 
     %>% ddply(.(group1), transform, X10=na.omit(X10)[1])   
     %>% ddply(.(group1), transform, X11=na.omit(X11)[1])   
     %>% ddply(.(group1), transform, X12=na.omit(X12)[1])   
     %>% ddply(.(group1), transform, X13=na.omit(X13)[1])   
     %>% ddply(.(group1), transform, X14=na.omit(X14)[1])   
     %>% ddply(.(group1), transform, X15=na.omit(X15)[1])   
     %>% ddply(.(group1), transform, X16=na.omit(X16)[1])   
...
and more column depends on data frame.

最后一部分,ddply 运行速度非常慢,尤其是生成了许多列。

问题,还有其他优化它的选项/建议吗?以及如何应用于所有尾矿列?

【问题讨论】:

    标签: r optimization transform plyr


    【解决方案1】:

    刚刚找到,但正在使用 library(data.table)

    setDT(ff)[, lapply(.SD, na.omit) , by = group1]
    

    【讨论】:

      【解决方案2】:

      另一个选项是dplyr

      library(dplyr)
      ff %>%
          group_by(group1) %>%
          mutate_each(funs(na.omit))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2013-11-03
        • 2014-02-21
        • 2013-01-11
        • 2019-09-07
        • 2018-02-09
        • 1970-01-01
        • 2010-12-06
        相关资源
        最近更新 更多