【问题标题】:R, Create data.frame conditional on colnames and row entries of existing dfR,以现有 df 的 colnames 和行条目为条件创建 data.frame
【发布时间】:2013-08-10 02:48:57
【问题描述】:

我对此question有后续跟进。

我正在根据现有 data.frame 的列名和特定行条目创建一个 data.frame。下面是我使用 for 循环 解决它的方法(感谢@Roland 的建议……真实数据违反了@eddi 回答的要求),但它一直在实际数据集上运行(200x500, 000+ rows.cols) 两个多小时了...

(以下生成的data.frames与实际数据非常相似。)

set.seed(1)
a <- data.frame(year=c(1986:1990),
                events=round(runif(5,0,5),digits=2))
b <- data.frame(year=c(rep(1986:1990,each=2,length.out=40),1986:1990), 
                region=c(rep(c("x","y"),10),rep(c("y","z"),10),rep("y",5)),
                state=c(rep(c("NY","PA","NC","FL"),each=10),rep("AL",5)),
                events=round(runif(45,0,5),digits=2))
d <- matrix(rbinom(200,1,0.5),10,20, dimnames=list(c(1:10), rep(1986:1990,each=4)))
e <- data.frame(id=sprintf("%02d",1:10), as.data.frame(d), 
                region=c("x","y","x","z","z","y","y","z","y","y"), 
                state=c("PA","AL","NY","NC","NC","NC","FL","FL","AL","AL"))


 for (i in seq_len(nrow(d))) {
   for (j in seq_len(ncol(d))) {
     d[i,j] <- ifelse(d[i,j]==0,
                      a$events[a$year==colnames(d)[j]],
                      b$events[b$year==colnames(d)[j] &
                               b$state==e$state[i] &
                               b$region==e$region[i]])
   }
 }

有没有更好/更快的方法来做到这一点?

【问题讨论】:

    标签: r for-loop dataframe apply


    【解决方案1】:

    一个更简单的方法(我认为 - 它不涉及熔化,dcasting和合并)如下:

    首先,您的 a 和 b 数组应按年份(对于 a)和年份/州/地区(对于 b)进行索引:

    at = a$events; names(at) = a$year
    
    bt = tapply(b$events,list(b$year,b$state,b$region),function(x) min(x))
    # note, I used min(x) in tapply just to be on the safe side, that the functions always returns a scalar
    
    # we now create the result of the more complex case (lookup in b)
    ids = cbind(colnames(d)[col(d)],
                as.character(e$state[row(d)]),
                as.character(e$region[row(d)])
               )
    vals=bt[ids]; dim(vals)=dim(d)
    # and compute your desired result with the ifelse
    result = ifelse(d==0,at[colnames(d)[col(d)]],vals)
    # and that's it!
    

    这应该更快(避免嵌套循环),但我没有对此进行分析。让我们知道它对您的完整数据有何作用

    【讨论】:

    • 谢谢@amit,它工作得很好,而且比嵌套循环快得多(尝试了 1000 行数据)。 elapsed: nstloop 18.39 index 0.06
    • 天啊!它在 3.64 秒内运行在整个数据集上!我的for loop 从昨天开始一直在运行,24 小时以上!非常感谢大家的帮助。
    【解决方案2】:
    # This will require a couple of merges,
    # but first let's convert the data to long form and extract year as integer
    # I convert result to data.table, since that's easier and faster to deal with
    # Note: it *is* possible to do the melt/dcast entirely in data.table framework,
    # but it's a hassle right now - there is a FR iirc about that
    library(reshape2)
    library(data.table)
    
    dt = data.table(melt(e))[, year := as.integer(sub('X([0-9]*).*','\\1',variable))]
    
    # set key for merging and merge with b and a
    setkey(dt, year, region, state)
    dt.result = data.table(a, key = 'year')[
                   data.table(b, key = c('year', 'region', 'state'))[dt]]
    
    # now we can compute the value we want
    dt.result[, final.value := value * events.1 + (!value) * events]
    
    # dcast back
    e.result = dcast(dt.result, id + region + state ~ variable,
                     value.var = 'final.value')
    

    【讨论】:

      猜你喜欢
      • 2015-05-27
      • 2016-01-02
      • 2013-07-16
      • 2016-07-08
      • 1970-01-01
      • 2014-11-03
      • 2017-06-18
      • 2021-08-31
      • 1970-01-01
      相关资源
      最近更新 更多