【问题标题】:Applying a function to consecutive subvectors of unequal and variable size将函数应用于大小不等且可变的连续子向量
【发布时间】:2015-01-16 14:55:15
【问题描述】:

这个问题最好通过示例来说明,并且与此处提出的问题略有不同: Applying function to consecutive subvectors of equal size

假设我有一些公司“MMM”和“ABT”的价格数据,例如(价格的日期存储在此数据框的行名中。):

> a
            MMM  ABT
1991-01-02 11.01 2.58
1991-01-03 10.83 2.48
1991-01-04 10.80 2.43
1991-01-07 10.67 2.39
1991-01-08 10.39 2.42
1991-01-09 10.18 2.42
1991-01-10 10.33 2.43
1991-01-11 10.59 2.44
1991-01-14 10.60 2.38
1991-01-15 10.54 2.39

首先,可能需要将此数据框中的日期拆分为相等的“j”间隔。假设 j = 2。以下是我们要查看的区间:

interval1 is from 1991-01-02 to 1991-01-03
interval2 is from 1991-01-04 to 1991-01-07
interval3 is from 1991-01-08 to 1991-01-09
interval4 is from 1991-01-10 to 1991-01-11
interval5 is from 1991-01-14 to 1991-01-15

如果最后一个值不存在,我想包含它,这就是我在下面使用 unique() 的原因。所以假设“j”间隔长度,我们可以以某种方式使用它们(可能有更好的方法来生成上述间隔):

beg <- rownames(a)[seq(1,nrow(a),2)]
# case for j = 2: 
# [1] "1991-01-02" "1991-01-04" "1991-01-08" "1991-01-10" "1991-01-14"

end <- rownames(a)[seq(1,nrow(a),2)+1]
end <- unique(c(end[!is.na(end)],rownames(a)[nrow(a)]))
# case for j = 2: 
# [1] "1991-01-03" "1991-01-07" "1991-01-09" "1991-01-11" "1991-01-15"

从这里,我有另一个数据框 (b),其中包含这样的数据:

> b
           portfolio_return
1991-01-09      0.010524144
1991-01-10     -0.010706638
1991-01-11     -0.015665796
1991-01-14     -0.015151515
1991-01-15      0.055000000
1991-01-16     -0.052173913                                                                                                                                                                                                      
1991-01-21     -0.010204082  

我要做的是找到每个间隔内的平均值。例如:

interval1_values = "NA"
interval2_values = "NA"
interval3_values = c(0.010524144)
interval4_values = c(-0.010706638,-0.015665796)
interval5_values = c(-0.015151515, 0.055000000)

#From this we can then easily calculate the average over each interval.

average1 = mean(interval1_values)
average2 = mean(interval2_values)
#etc...

我目前的解决方案是这样的:

averages_interval <- function(a,b,j){
  # replace 2 with j
  beg <- rownames(a)[seq(1,nrow(a),j)]

  # replace 2 with j
  # replace 1 with j-1
  end <- rownames(a)[seq(1,nrow(a),j)+j-1]
  end <- unique(c(end[!is.na(end)],rownames(a)[nrow(a)]))

  c <- rownames(b)

  tmp <- c()
  j <- 1
  # these loops match our c-vector values in their proper interval
  # for j = 2 case, it places c[1] in interval3, c[2] in interval4, and so on...
  for(i in 1:length(c)){

    while(j <= length(end)){

      if(c[i]>=beg[j] && c[i]<=end[j]){
        tmp <- c(tmp,j)
      }
      j <- j+1
    }
    j <- tmp[length(tmp)]
  }

  df <- data.frame(b,group=tmp)
  df <- df[complete.cases(df),]
  #row_names <- rownames(df)
  # variable needed to store dates if needed later on since we use data.table
  df <- data.table(df)
  averages <- df[,list(mean=mean(portfolio_return)),by=group][[2]]


  return(averages)

}

###### for j = 2
       group        mean
1:     2  0.01052414
2:     3  0.01318622
3:     4  0.01992424

有没有更有效的方法来解决这个问题?

非常感谢。

【问题讨论】:

    标签: r average mean


    【解决方案1】:

    您可以在下面找到使用data.table 的解决方案:

    # reading in your data
    x <- read.table(text='MMM  ABT
    1991-01-02 11.01 2.58
    1991-01-03 10.83 2.48
    1991-01-04 10.80 2.43
    1991-01-07 10.67 2.39
    1991-01-08 10.39 2.42
    1991-01-09 10.18 2.42
    1991-01-10 10.33 2.43
    1991-01-11 10.59 2.44
    1991-01-14 10.60 2.38
    1991-01-15 10.54 2.39', header=TRUE, row.names=1)
    #
    y <- read.table(text='portfolio_return
    1991-01-09      0.010524144
    1991-01-10     -0.010706638
    1991-01-11     -0.015665796
    1991-01-14     -0.015151515
    1991-01-15      0.055000000
    1991-01-16     -0.052173913                                                                                                                                                                                                      
    1991-01-21     -0.010204082', header=TRUE, row.names=1)
    # load required packages
    require(data.table)
    require(zoo)
    # setting to data.table
    setDT(x, keep.rownames=TRUE)
    setDT(y, keep.rownames=TRUE)
    # defining the intervals 
    # DOUBLE CHECK THIS; I DON'T UNDERSTAND HOW YOU DEFINE THESE
    x[, interval := c(1, rep(1:nrow(x), each=2))[1:nrow(x)]] 
    # merge data
    res <- merge(x, y, by='rn', all = TRUE)
    # setting the date as key
    res[, rn := as.Date(rn)]
    setkey(res, 'rn')
    # perhaps carry forward last observation?
    # THIS MAY NOT BE WHAT YOU WANT... FEEL FREE TO CHANGE
    res[, interval := na.locf(interval)]
    # calculate means, start and end of interval
    res[, list(start = min(rn), 
               end = max(rn),
               mean_return = mean(portfolio_return)), by=interval]
    
    ##    interval      start        end  mean_return
    ## 1:        1 1991-01-02 1991-01-04           NA
    ## 2:        2 1991-01-07 1991-01-08           NA
    ## 3:        3 1991-01-09 1991-01-10 -0.000091247
    ## 4:        4 1991-01-11 1991-01-14 -0.015408656
    ## 5:        5 1991-01-15 1991-01-21 -0.002459332
    

    【讨论】:

    • 谢谢!我真正感兴趣的是最终的mean_returns向量。有没有办法在不使用 data.frames 和存储所有其他数据的情况下实现这一目标?
    • 另外,setDT(x, keep.rownames=TRUE) 似乎有错误
    猜你喜欢
    • 1970-01-01
    • 2011-08-31
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多