【发布时间】:2020-03-12 16:50:30
【问题描述】:
电能表通常不在月初和月底开始和结束,而是与日历不均匀地重叠。我正在尝试使用加权平均逻辑来排列这些读取日期并计算单个月份的值。我附上了我的代码示例,它构建了一个与我正在使用的数据集相似的数据集。每行是一个单独的能量计。每 3 列代表一个开始日期和结束日期,以及该时间段使用的能量值。
我一直在处理数十万行,这个过程需要 20 多分钟。我很想能够使用data.table,但我对它太陌生了,鉴于数据的列结构,我不确定如何让seq.Date 工作。
# Making the Fake Dataset
set.seed(123)
fake_rows = 10
{
testdata <- replicate(fake_rows, {
start_it <- as.Date('2019/01/01') + sample(-20:20, 1, T)
track <- start <- end <- value <- c()
for(i in 1:12){
a <- seq.Date(start_it, length.out = sample(28:34,1), by="day")
start[i] <- a[1]
end[i] <- start_it <- a[length(a)]
value[i] <- sample(1:200,1)
track <- c(track, start[i], end[i], value[i])
}
return(track)
})
testdata <- as.data.frame(t(testdata))
month_labels <- c(paste0("0",1:9), "10","11","12")
start_dates <- sapply(month_labels, function(x) paste0("Start_Date_",x))
end_dates <- sapply(month_labels, function(x) paste0("End_Date_",x))
values <- sapply(month_labels, function(x) paste0("Value_",x))
colnames(testdata) <- c(rbind(start_dates,end_dates,values))
# replace columns with the dates
for(i in c(start_dates, end_dates)){
testdata[,i] <- as.Date(testdata[,i], origin = "1970-01-01")
}
testdata[2, 7:36] <- NA # some are missing dates and values
}
testdata
# Start_Date_01 End_Date_01 Value_01 Start_Date_02 End_Date_02 Value_02
#1 2019-01-11 2019-02-13 179 2019-02-13 2019-03-17 195
#2 2018-12-20 2019-01-21 164 2019-01-21 2019-02-22 81
#3 2019-01-05 2019-02-02 69 2019-02-02 2019-03-04 63
#4 2018-12-28 2019-01-29 50 2019-01-29 2019-02-25 34
#5 2019-01-15 2019-02-16 199 2019-02-16 2019-03-17 151
#6 2019-01-15 2019-02-16 94 2019-02-16 2019-03-21 24
#7 2019-01-05 2019-02-07 54 2019-02-07 2019-03-07 137
#8 2019-01-16 2019-02-15 108 2019-02-15 2019-03-19 177
#9 2018-12-25 2019-01-25 16 2019-01-25 2019-02-27 125
#10 2019-01-09 2019-02-07 10 2019-02-07 2019-03-10 54
我采用了下面的 data.frame 方法:
library(data.table)
# for each row, determine what monthly values would be
output <- matrix(NA, nrow = nrow(testdata), ncol = 12)
month_cols <- as.character(1:12)
for(i in 1:nrow(testdata)){
x <- y <- vector("list", 12)
for(j in 1:12){
if(!is.na(testdata[i, start_dates[j]])){
# get the counts of days in each month within the meter read period
x[[j]] <- table(month(seq.Date(testdata[i, start_dates[j]], testdata[i, end_dates[j]], "day")))
# multiply the meter read value by days in each month (the numerator of a day wtd avg)
y[[j]] <- testdata[i, values[j]] * x[[j]]
}
months <- names(unlist(y))
# day weighted average = Σ(value x Days) / Σ(Days)
final <- tapply(unlist(y), months, sum) / tapply(unlist(x), months, sum)
output[i,] <- final[match(month_cols, names(final))] # ordered in the case of missing months
}
}
output
其中行是原始数据集的行,列表示从 1 月到 2 月的估计值,没有附加特定年份,因为我对跨月的所有值进行日加权,而不考虑年份。
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
# [1,] 140.77778 187.82759 127.03125 46.16129 28.50000 81.25806 125.8750 91.00000 91.516129 120.1250 108.80645 32.87500
# [2,] 135.46875 81.00000 NA NA NA NA NA NA NA NA NA 164.00000
# [3,] 80.61290 63.41379 92.75000 91.77419 39.96970 45.74194 87.6875 20.87500 100.838710 196.4375 86.00000 154.43750
# [4,] 48.50000 31.10345 30.81250 130.35484 128.43750 48.70968 117.8125 27.81250 55.322581 137.0312 123.38710 145.65714
# [5,] 142.03571 177.48276 137.40625 106.48387 102.53125 116.00000 86.0000 102.25000 112.032258 153.4375 183.29032 96.50000
# [6,] 88.34286 62.62069 52.53125 126.87097 132.62500 128.19355 157.9688 103.43750 9.612903 30.6250 93.67742 131.09375
# [7,] 62.91429 116.96552 67.46875 72.83871 102.25000 171.32258 178.5000 112.50000 38.645161 131.0000 127.22581 96.43750
# [8,] 86.08696 141.31034 129.06250 35.77419 97.00000 122.93548 146.3125 128.18750 151.161290 199.1250 172.90323 74.75000
# [9,] 39.84375 119.13793 70.00000 180.64516 85.12500 49.64516 116.5000 92.28125 117.225806 46.1250 27.35484 29.16129
#[10,] 37.77143 43.37931 90.43750 51.45161 25.71875 120.22581 111.6562 126.81250 123.193548 46.0625 84.74194 97.53125
如何提高性能?
【问题讨论】:
-
只是为了确保我理解,如果日期范围是 2019-01-11 2019-02-13,您想将其计算为 1 月的 21 天和 2 月的 13 天。因此,179 的值被划分为 21/34 * 179 分配给一月和 13/34 * 179 分配给二月?
-
@eipi10 是的,完全正确。我写它的方式是使用另一组列中新二月天的权重,包括不同年份的二月。但数学如你所说
-
示例解决方案对我不起作用;
month函数来自什么包,润滑?month_cols在哪里定义?你能分享一下output的样子吗? -
抱歉,现在修复。 month 似乎是我正在使用的
data.table函数。 -
我的代码也可能错误地重复计算天数...
sum(testdata[1,values])和sum(output[1,])应该几乎相同,假设 12 次读取不超过 365 天的数据...
标签: r data.table