R中带有data.table的多个循环答案

【问题标题】：Multiple loops with data.table in RR中带有data.table的多个循环
【发布时间】：2017-05-03 17:38:18
【问题描述】：

我尝试用 data.table 处理多个循环已经有一段时间了，结果很沮丧。使用 sql 它非常直观，但是使用 RI 我遇到了一些问题。

例如，我想读取一个 txt 文件（因为我有数百个文件，每个文件大约 1 GB），进行计算（总价格和数量，当 time>my.time 和某些选定的 isin 时，分组通过 my.time、isin 和 price），将结果写入某个 csv 文件，从 R 内存中删除原始 txt 文件；然后对所有 txt 文件逐一重做这些计算，并附加输出 csv 文件。

让我们从示例数据开始（非常小，只是两个相同的文件用于说明）：

        time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "1 sec"),"%H:%M:%S")
        n<-length(time)
        isin<-paste("US",1:n,sep="")
        price<-rnorm(n,101,1)
        quant<-rnorm(n,5,1)
        dt<-data.table(time,isin,price,quant)
        write.table(dt,"raw.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)
        write.table(dt,"raw2.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)

    my.files <- list.files(pattern = "raw*.txt")
    my.time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "5 min"),"%H:%M:%S")
    my.isin<-c("US100","US150","US225","US250","US1050")

然后我尝试这两个简单的循环：

       for (i in my.files){
              for (j in my.time){
              dt<-fread(i)
        write.table(dt[which(isin %in% my.isin & time>j),
           .(sprice=sum(price),squant=sum(quant),**time.my=j**), by = .(isin,price)],
           "output.csv",append = TRUE,sep = ",",col.names = TRUE)
        rm(dt) 
        }}

第二次编辑： j 的循环终于开始为我工作（由于粗体部分）。也许没有 for 循环也可以工作并获得相同的结果？

非常感谢您的帮助！

【问题讨论】：

什么时候它不适合你？您是否阅读了代码中出现的警告和错误信息？
是的，起初我收到这条消息：Error in [.data.table(dt, which(isin %in% my.isin & time > my.time), : The items in the 'by' or 'keyby' list are length (86401,1,86401). Each must be same length as rows in x or number of rows returned by i (0).
好的，所以它抱怨您在by 中使用j。也许你需要回去想想你在那里试图用它做什么..？（你的大部分代码和我平时看到的很不一样，所以我不太明白。）
我想每 5 分钟建立一次累积交易统计：我试图找到一些选定 Isin 的所有价格和数量的总和，其中实际交易时间高于我定义的时间间隔（每 5 分钟） .结果应按 Isin、唯一价格（因为可能有相同的价格）和定义的时间（每 5 分钟一次）分组。
@Linas 而不是额外的循环（每 5 分钟），您可以尝试在时间列上使用非 equi 连接。也不要对像“价格”这样的双精度数据类型（浮点数）进行分组，在进行分组时将其格式化为定义的精度。

标签： r loops for-loop data.table dt

【解决方案1】：

您遇到的问题是 which 语句的输出返回零行。首先，我会将您的时间转换为 time 类型。然后我创建了一个 5 分钟的分组变量。

这将首先聚合您的表。

dt[,`:=`(time= as.ITime(strptime(time, format="%H:%M:%S")))]
dt[,`:=`(time5 = format(strptime("1970-01-01", "%Y-%m-%d", tz="UTC") + 
                          round(as.numeric(time)/300)*300,"%H:%M"))]

dt[, list(sprice = sum(price),squant= sum(quant)),by = c("time5","price","isin")][isin %in% my.isin]


#    time5     price    isin    sprice   squant
# 1: 00:00 102.46668     US1 102.46668 3.002960
# 2: 00:00  99.02186     US2  99.02186 5.253252
# 3: 00:00 100.23665     US3 100.23665 6.153950
# 4: 00:00 102.21466     US4 102.21466 3.461051
# 5: 00:00 100.97890     US5 100.97890 5.893336

然后您可以通过您的my.isn 或大于自定义时间的 time5 对其进行过滤？

【讨论】：

感谢 theArun！但是在你的例子中 t1 是什么意思？
啊，对不起。那是time
再次感谢。但是我需要将时间与一些不在数据表中的“外部”向量进行比较（因为我需要以这种方式处理我的原始数据），即不执行第二步（创建 time5）。像这样的东西（为了简单起见，我们可以删除 isin）：my.time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "5 min"),"%H:%M:%S") 然后 dt[, list(sprice = sum(price),squant= sum(quant)),by =c("my.time","price")][time>my.time] 但我再次收到此错误：The items in the 'by' or 'keyby' list are length (289,86401). Each must be same length...
这可能是原因，my.time 是一个长度为 28986401 的变量列表。它有很多变量可以分组。您可以在 data.table 中创建 my.time。
@Arun。 my.time 实际上只有 289 长度（289,86401 之间有一个逗号）。我还尝试用我指定的时间my.time.dt<-data.table(my.time) 创建另一个数据表，但是当我尝试运行dt[, list(sprice = sum(price),squant= sum(quant)),by = c("my.time.dt","price")][time>my.time.dt] 时这并没有帮助。错误：by=c(...), key(...) or names(...) must evaluate to 'character'。也许可以将 apply 用于此循环？