重建一个填补空白的data.frame答案

【问题标题】：Recontructing a data.frame with gaps filled重建一个填补空白的data.frame
【发布时间】：2014-11-22 01:49:23
【问题描述】：

我对 R 很陌生，正在做一个需要帮助的项目。

我有一个包含一年数据的 CSV 文件。但是，时间序列中存在一些差距，我需要它每半小时均匀分布（每天 48 行乘以一年中的 365 天，全年将产生 17520 行数据）。间隔从1个半小时到几天不等。这些缺失的时间戳不存在行。因此，我使用了其他一些论坛帖子来帮助我制作一个脚本，将 CSV 导入 R，通过创建行使时间戳列具有正确的长度，然后将数据与新的时间戳列匹配。

但是，我有大约 3 打数据列来匹配新的时间戳，而我现在这样做的方式非常低效。截至目前，存在具有正确时间戳的 data.frame (newdata4)。然后，我在该框架中添加一个新列，其中包含来自 missing4 data.frame 的原始数据：

newdata4 <- as.data.frame(timestamp_corr)
newdata4$PAR_in_Avg <- missing4$PAR_in_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PAR_in_Avg[is.na(newdata4$PAR_in_Avg)] <- -9999 # replace NAs with -9999

在此示例中，PAR_in_Avg 是原始 CSV 文件中的一列。这很好用。但是，为了将所有列都放入 newdata4，我一遍又一遍地重复这些行：

newdata4$PAR_in_Avg <- missing4$PAR_in_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PAR_in_Avg[is.na(newdata4$PAR_in_Avg)] <- -9999 # replace NAs with -9999
newdata4$PAR_out_Avg <- missing4$PAR_out_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PAR_out_Avg[is.na(newdata4$PAR_out_Avg)] <- -9999 # replace NAs with -9999
newdata4$Rn_meas_Avg <- missing4$Rn_meas_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$Rn_meas_Avg[is.na(newdata4$Rn_meas_Avg)] <- -9999 # replace NAs with -9999
newdata4$PYRA_CMP3_Avg <- missing4$PYRA_CMP3_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$PYRA_CMP3_Avg[is.na(newdata4$PYRA_CMP3_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_1_Avg <- missing4$G_1_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_1_Avg[is.na(newdata4$G_1_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_2_Avg <- missing4$G_2_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_2_Avg[is.na(newdata4$G_2_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_3_Avg <- missing4$G_3_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_3_Avg[is.na(newdata4$G_3_Avg)] <- -9999 # replace NAs with -9999
newdata4$G_4_Avg <- missing4$G_4_Avg[pmatch(newdata4$timestamp_corr, missing4$timestamp)] # add data where there was an original timestamp
newdata4$G_4_Avg[is.na(newdata4$G_4_Avg)] <- -9999 # replace NAs with -9999

这是不可持续的，因为我必须在其他网站和其他年份（每个都有不同的列标题）这样做。理想情况下，我希望 R 读取此 CSV 文件的第一行以确定有多少列，然后在构建新的时间序列后使用 pmatch 将每一列添加回来。

我能够合并 newdata4 data.frame 和原始 missing4 data.frame，但这样做会删除刚刚为间隙创建的所有行。

有没有一些不需要重复的简单方法将数据重新组合在一起？

【问题讨论】：

最好也显示几行数据集。

标签： r

【解决方案1】：

试试

newdat <- data.frame(timestamp=with(dat, seq(min(timestamp),
                     max(timestamp), by='30 min')))

dat1 <- merge(dat, newdat, by='timestamp', all=TRUE)
indx <- setdiff(colnames(dat1), 'timestamp')
dat1[indx][is.na(dat1[indx])] <- -9999
head(dat1)

数据

set.seed(42)
dat <- data.frame(timestamp= sort(sample(seq(as.POSIXct('1996-01-01'),
    length.out=50, by='30 min'),30, replace=FALSE)), value1=rnorm(30),
    value2=runif(30))

【讨论】：