【发布时间】:2012-09-10 20:22:02
【问题描述】:
我有几个大型数据框(100 万+ 行 x 6-10 列)我需要重复子集化。子集部分是我的代码中最慢的部分,我很好奇是否有办法更快地做到这一点。
load("https://dl.dropbox.com/u/4131944/Temp/DF_IOSTAT_ALL.rda")
start_in <- strptime("2012-08-20 13:00", "%Y-%m-%d %H:%M")
end_in<- strptime("2012-08-20 17:00", "%Y-%m-%d %H:%M")
system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
> system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
user system elapsed
16.59 0.00 16.60
dput(head(DF_IOSTAT_ALL))
structure(list(date_stamp = structure(list(sec = c(14, 24, 34,
44, 54, 4), min = c(0L, 0L, 0L, 0L, 0L, 1L), hour = c(0L, 0L,
0L, 0L, 0L, 0L), mday = c(20L, 20L, 20L, 20L, 20L, 20L), mon = c(7L,
7L, 7L, 7L, 7L, 7L), year = c(112L, 112L, 112L, 112L, 112L, 112L
), wday = c(1L, 1L, 1L, 1L, 1L, 1L), yday = c(232L, 232L, 232L,
232L, 232L, 232L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt")), cpu = c(0.9, 0.2, 0.2, 0.1,
0.2, 0.1), rsec_s = c(0, 0, 0, 0, 0, 0), wsec_s = c(0, 3.8, 0,
0.4, 0.2, 0.2), util_pct = c(0, 0.1, 0, 0, 0, 0), node = c("bda101",
"bda101", "bda101", "bda101", "bda101", "bda101")), .Names = c("date_stamp",
"cpu", "rsec_s", "wsec_s", "util_pct", "node"), row.names = c(NA,
6L), class = "data.frame")
【问题讨论】:
-
我相信你可以做得更快,但最好的方法将取决于
DF_IOSTAT_ALL的结构。你能提供那个物体的小样本吗?例如。dput(head(DF_IOSTAT_ALL))的输出。 -
@JoshuaUlrich 我添加了请求的输出。很抱歉没有包括第一次。
-
你在做什么子集?
-
出于兴趣,这有多慢?
-
@BlueMagister 我将其细分为时间片。它是来自 iostat 在机器集群上的性能数据。我有一些性能测试的开始和结束时间。因此,我想将其子集化为测试的时间范围,然后绘制它。希望这就是你要问的..
标签: performance r dataframe subset