【发布时间】:2015-12-23 22:01:19
【问题描述】:
继我昨天提出的问题here 之后,我正在尝试设计一个循环,该循环将根据第二个数据集df2 中匹配日期、时间和ID 的唯一组合,对数据df1 中的事件进行子集化.每次迭代的输出将是多行,并且每次迭代将具有不同的行数,或者可能为空。最后,我需要将所有迭代输出合并到 1 个数据框中,显示每个日期的每个事件的日期、时间和 ID 号。分配一个空矩阵并运行一个常规的 FOR 循环或嵌套循环并没有让我到任何地方。我不知道我是否需要从不同类型的结构开始,或者我的尺寸是否错误。也许有一个更简单的方法。
这是数据结构的示例(尽管原始数据要长得多)。
dput(df1)
structure(list(Date = c("12-31-2008", "12-31-2008", "12-31-2008",
"12-31-2008", "12-31-2008", "12-31-2008", "01-01-2009", "01-01-2009",
"01-01-2009", "01-01-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-11-2009", "01-11-2009", "01-17-2009", "01-17-2009",
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009",
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009",
"01-18-2009", "01-18-2009", "01-19-2009", "01-19-2009", "01-19-2009",
"01-19-2009", "01-19-2009"), IDNum = c("534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198"), Time = c("19:01",
"19:53", "20:55", "22:03", "23:04", "23:55", "00:45", "01:48",
"02:50", "03:50", "02:35", "03:42", "04:49", "05:53", "06:55",
"07:55", "08:43", "10:23", "10:31", "11:41", "15:27", "16:33",
"17:41", "18:46", "19:46", "20:48", "21:48", "22:48", "23:48",
"01:49", "02:49", "21:49", "22:49", "12:04", "13:04", "15:05",
"16:05", "17:05", "18:07", "18:49", "19:49", "20:49", "21:49",
"22:50", "23:50", "00:50", "01:50", "03:02", "04:22", "05:25"
)), .Names = c("Date", "IDNum", "Time"), row.names = 8643:8692, class = "data.frame")
dput(df2)
structure(list(Date = c("01-04-2009", "01-05-2009", "01-05-2009",
"01-06-2009", "01-06-2009", "01-07-2009", "01-07-2009", "01-08-2009",
"01-08-2009", "01-09-2009", "01-09-2009", "01-10-2009", "01-11-2009",
"01-12-2009", "01-12-2009", "01-13-2009", "01-14-2009", "01-14-2009",
"01-21-2009", "01-21-2009", "01-22-2009", "01-22-2009", "01-23-2009",
"01-23-2009", "01-24-2009", "01-24-2009", "01-25-2009", "01-25-2009",
"01-26-2009", "01-26-2009", "01-27-2009", "01-28-2009", "01-28-2009",
"01-28-2009", "01-28-2009", "01-29-2009", "01-29-2009", "01-29-2009",
"01-29-2009", "02-05-2009", "02-05-2009", "02-05-2009", "02-06-2009",
"02-06-2009", "02-06-2009", "02-07-2009", "02-07-2009", "02-07-2009",
"02-08-2009", "02-08-2009"), IDNum = c("599091", "599091", "599091",
"599091", "599091", "599091", "599091", "599091", "599091", "599091",
"599091", "599091", "599091", "599091", "599091", "599091", "599091",
"599091", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"697345", "697345", "534198", "534198", "697345", "697345", "697345",
"534198", "697345", "697345", "697345", "697345", "697345", "697345",
"697345", "697345", "697345", "697345", "697345"), Trip = c("GL0229",
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229",
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229",
"GL0229", "GL0229", "GL0229", "GL0230", "GL0230", "GL0230", "GL0230",
"GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230",
"GL0230", "GL0230", "GL0233", "GL0233", "GL0230", "GL0230", "GL0233",
"GL0233", "GL0233", "GL0230", "GL0234", "GL0234", "GL0234", "GL0234",
"GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234"
), Replicate = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 14L, 15L, 3L, 4L, 5L, 16L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), Start = c("12:00",
"08:35", "15:33", "08:30", "15:51", "10:02", "23:04", "11:17",
"21:31", "11:16", "20:07", "11:28", "07:37", "08:40", "16:32",
"09:14", "08:04", "15:15", "07:16", "16:17", "07:10", "16:40",
"07:00", "16:25", "07:17", "16:50", "07:20", "16:18", "07:20",
"15:40", "07:10", "09:34", "11:07", "07:55", "16:38", "07:01",
"08:26", "14:47", "07:18", "07:47", "09:17", "14:58", "07:48",
"08:59", "14:53", "07:30", "09:12", "13:47", "08:56", "09:53"
), End = c("17:21", "15:08", "22:44", "15:12", "09:06", "19:16",
"10:28", "20:12", "10:14", "18:48", "10:53", "20:23", "14:07",
"15:02", "22:27", "18:03", "15:07", "21:19", "16:04", "22:04",
"16:31", "23:01", "16:15", "22:07", "16:33", "22:37", "16:05",
"22:17", "15:22", "22:31", "16:05", "16:41", "19:01", "16:20",
"21:56", "14:31", "19:46", "00:30", "15:10", "14:21", "19:27",
"23:45", "14:31", "19:20", "23:05", "14:51", "20:15", "00:17",
"14:31", "18:07")), .Names = c("Date", "IDNum", "Trip", "Replicate",
"Start", "End"), row.names = 506:555, class = "data.frame")
首先,我找到了两个数据集之间匹配的日期,并创建了一个新变量records 以根据匹配日期显示来自df2 的信息。在这个例子中,我只是使用第二个匹配日期:
match_dates <- as.character(intersect(df1$Date, df2$Date))
records <- df2[which(df2$Date == match_dates[2]),]
print(records)
Date IDNum Trip Replicate Start End
518 01-11-2009 599091 GL0229 13 07:37 14:07
在更大的原始数据集中,records 最终会更像这样:
records <- df2[which(df2$Date == match_dates[25]),]
print(records)
# Date IDNum Trip Replicate Start End
# 659 04-02-2009 507646 GL0247 10 09:43 05:19
# 660 04-02-2009 680845 GL0249 4 05:37 11:29
# 661 04-02-2009 680845 GL0249 5 11:59 16:47
records 的每次迭代感兴趣的事件随后被定义为 df1 在 Start 和 End 之间的时间,就像这样(我这样做是为了保留日期时间的唯一组合-ID-复制):
event1 <- subset(df1, Date==records[1,"Date"] & IDNum==records[1,"IDNum"] & Time >= records[1,"Start"] & Time <= records[1,"End"])
event2 <- subset(df1, Date==records[2,"Date"] & IDNum==records[2,"IDNum"] & Time >= records[2,"Start"] & Time <= records[2,"End"])
event3 <- subset(df1, Date==records[3,"Date"] & IDNum==records[3,"IDNum"] & Time >= records[3,"Start"] & Time <= records[3,"End"])
每个事件的结果如下所示:
print(event1) #This result is empty
[1] NewRecNum Date IDNum Time Speed
<0 rows> (or 0-length row.names)
print(event2)
Date IDNum Time
80620 04-02-2009 680845 06:35
80621 04-02-2009 680845 07:35
80622 04-02-2009 680845 08:35
80623 04-02-2009 680845 09:35
80624 04-02-2009 680845 10:35
print(event3)
Date IDNum Time
80626 04-02-2009 680845 12:35
80627 04-02-2009 680845 13:35
80628 04-02-2009 680845 14:35
80629 04-02-2009 680845 15:35
80630 04-02-2009 680845 16:35
我的目标是创建一个循环,该循环将从match_dates(在本例中为147)获取匹配日期的每个实例,从df2创建147个对应的records,然后使用日期、IDNum、开始,和每个records 中的结束时间到子集df1 并输出df1 事件。到目前为止我所拥有的(那是行不通的):
records <- matrix(ncol=6, nrow=nrow(df1)) # Create an empty matrix to start
event=NULL
for (i in 1:length(match_dates))
{ records[i] <- df2[which(df2$Date == match_dates[i]), ]
for (j in 1:nrow(records[i]))
{ event[j] <- subset(df1, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
}
}
print(event)
Error in 1:nrow(records[i]) : argument of length 0
In addition: Warning message:
In records[i] <- df2[which(df2$Date == match_dates[i]), ] :
number of items to replace is not a multiple of replacement length
> print(event)
NULL
提前感谢您的帮助!我正在为此撞墙。
编辑/更新:
我把records改成了
records <- subset(df2, Date %in% df1$Date)
然后编写一个函数,将df1中的匹配行子集化为
event_func <- function(df,records,i){
event_int <- subset(df, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
return(event_int)
}
此功能有效,并输出我需要的内容。但是我仍然无法处理一个循环,该循环将获取records 的 686 行,将它们与df1 匹配,并输出所有匹配的df1 行的最终数据帧。我也尝试使用lapply 这是我所拥有的(两者都不起作用):
# First option using a loop
final <- data.frame()
event_int <- data.frame()
for (i in 1:nrow(records)) {
event_int[i] <- event_func(df1, records,i)
final <- rbind(event_int, event_int[i])
}
# Second option using lapply
lapply(records, event_func(df1,records,1:nrow(records)))
再次感谢您的帮助!
【问题讨论】: