R循环对大数据框进行子集化并给出多行输出答案

【问题标题】：R Loop to subset large data frame and give multiple row outputR循环对大数据框进行子集化并给出多行输出
【发布时间】：2015-12-23 22:01:19
【问题描述】：

继我昨天提出的问题here 之后，我正在尝试设计一个循环，该循环将根据第二个数据集df2 中匹配日期、时间和ID 的唯一组合，对数据df1 中的事件进行子集化.每次迭代的输出将是多行，并且每次迭代将具有不同的行数，或者可能为空。最后，我需要将所有迭代输出合并到 1 个数据框中，显示每个日期的每个事件的日期、时间和 ID 号。分配一个空矩阵并运行一个常规的 FOR 循环或嵌套循环并没有让我到任何地方。我不知道我是否需要从不同类型的结构开始，或者我的尺寸是否错误。也许有一个更简单的方法。

这是数据结构的示例（尽管原始数据要长得多）。

dput(df1)
structure(list(Date = c("12-31-2008", "12-31-2008", "12-31-2008", 
"12-31-2008", "12-31-2008", "12-31-2008", "01-01-2009", "01-01-2009", 
"01-01-2009", "01-01-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-11-2009", "01-11-2009", "01-17-2009", "01-17-2009", 
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", 
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", 
"01-18-2009", "01-18-2009", "01-19-2009", "01-19-2009", "01-19-2009", 
"01-19-2009", "01-19-2009"), IDNum = c("534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198"), Time = c("19:01", 
"19:53", "20:55", "22:03", "23:04", "23:55", "00:45", "01:48", 
"02:50", "03:50", "02:35", "03:42", "04:49", "05:53", "06:55", 
"07:55", "08:43", "10:23", "10:31", "11:41", "15:27", "16:33", 
"17:41", "18:46", "19:46", "20:48", "21:48", "22:48", "23:48", 
"01:49", "02:49", "21:49", "22:49", "12:04", "13:04", "15:05", 
"16:05", "17:05", "18:07", "18:49", "19:49", "20:49", "21:49", 
"22:50", "23:50", "00:50", "01:50", "03:02", "04:22", "05:25"
)), .Names = c("Date", "IDNum", "Time"), row.names = 8643:8692, class = "data.frame")

dput(df2)
structure(list(Date = c("01-04-2009", "01-05-2009", "01-05-2009", 
"01-06-2009", "01-06-2009", "01-07-2009", "01-07-2009", "01-08-2009", 
"01-08-2009", "01-09-2009", "01-09-2009", "01-10-2009", "01-11-2009", 
"01-12-2009", "01-12-2009", "01-13-2009", "01-14-2009", "01-14-2009", 
"01-21-2009", "01-21-2009", "01-22-2009", "01-22-2009", "01-23-2009", 
"01-23-2009", "01-24-2009", "01-24-2009", "01-25-2009", "01-25-2009", 
"01-26-2009", "01-26-2009", "01-27-2009", "01-28-2009", "01-28-2009", 
"01-28-2009", "01-28-2009", "01-29-2009", "01-29-2009", "01-29-2009", 
"01-29-2009", "02-05-2009", "02-05-2009", "02-05-2009", "02-06-2009", 
"02-06-2009", "02-06-2009", "02-07-2009", "02-07-2009", "02-07-2009", 
"02-08-2009", "02-08-2009"), IDNum = c("599091", "599091", "599091", 
"599091", "599091", "599091", "599091", "599091", "599091", "599091", 
"599091", "599091", "599091", "599091", "599091", "599091", "599091", 
"599091", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"697345", "697345", "534198", "534198", "697345", "697345", "697345", 
"534198", "697345", "697345", "697345", "697345", "697345", "697345", 
"697345", "697345", "697345", "697345", "697345"), Trip = c("GL0229", 
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", 
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", 
"GL0229", "GL0229", "GL0229", "GL0230", "GL0230", "GL0230", "GL0230", 
"GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", 
"GL0230", "GL0230", "GL0233", "GL0233", "GL0230", "GL0230", "GL0233", 
"GL0233", "GL0233", "GL0230", "GL0234", "GL0234", "GL0234", "GL0234", 
"GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234"
), Replicate = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 14L, 15L, 3L, 4L, 5L, 16L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), Start = c("12:00", 
"08:35", "15:33", "08:30", "15:51", "10:02", "23:04", "11:17", 
"21:31", "11:16", "20:07", "11:28", "07:37", "08:40", "16:32", 
"09:14", "08:04", "15:15", "07:16", "16:17", "07:10", "16:40", 
"07:00", "16:25", "07:17", "16:50", "07:20", "16:18", "07:20", 
"15:40", "07:10", "09:34", "11:07", "07:55", "16:38", "07:01", 
"08:26", "14:47", "07:18", "07:47", "09:17", "14:58", "07:48", 
"08:59", "14:53", "07:30", "09:12", "13:47", "08:56", "09:53"
), End = c("17:21", "15:08", "22:44", "15:12", "09:06", "19:16", 
"10:28", "20:12", "10:14", "18:48", "10:53", "20:23", "14:07", 
"15:02", "22:27", "18:03", "15:07", "21:19", "16:04", "22:04", 
"16:31", "23:01", "16:15", "22:07", "16:33", "22:37", "16:05", 
"22:17", "15:22", "22:31", "16:05", "16:41", "19:01", "16:20", 
"21:56", "14:31", "19:46", "00:30", "15:10", "14:21", "19:27", 
"23:45", "14:31", "19:20", "23:05", "14:51", "20:15", "00:17", 
"14:31", "18:07")), .Names = c("Date", "IDNum", "Trip", "Replicate", 
"Start", "End"), row.names = 506:555, class = "data.frame")

首先，我找到了两个数据集之间匹配的日期，并创建了一个新变量records 以根据匹配日期显示来自df2 的信息。在这个例子中，我只是使用第二个匹配日期：

match_dates <- as.character(intersect(df1$Date, df2$Date))
records <- df2[which(df2$Date == match_dates[2]),]
print(records)

          Date  IDNum   Trip Replicate Start   End
518 01-11-2009 599091 GL0229        13 07:37 14:07

在更大的原始数据集中，records 最终会更像这样：

records <- df2[which(df2$Date == match_dates[25]),]
print(records)
#           Date  IDNum   Trip Replicate Start   End
# 659 04-02-2009 507646 GL0247        10 09:43 05:19
# 660 04-02-2009 680845 GL0249         4 05:37 11:29
# 661 04-02-2009 680845 GL0249         5 11:59 16:47

records 的每次迭代感兴趣的事件随后被定义为 df1 在 Start 和 End 之间的时间，就像这样（我这样做是为了保留日期时间的唯一组合-ID-复制）：

event1 <- subset(df1, Date==records[1,"Date"] & IDNum==records[1,"IDNum"] & Time >= records[1,"Start"] & Time <= records[1,"End"])
event2 <- subset(df1, Date==records[2,"Date"] & IDNum==records[2,"IDNum"] & Time >= records[2,"Start"] & Time <= records[2,"End"])
event3 <- subset(df1, Date==records[3,"Date"] & IDNum==records[3,"IDNum"] & Time >= records[3,"Start"] & Time <= records[3,"End"])

每个事件的结果如下所示：

print(event1) #This result is empty
    [1] NewRecNum Date      IDNum     Time      Speed    
    <0 rows> (or 0-length row.names)

print(event2)
            Date  IDNum  Time
80620 04-02-2009 680845 06:35
80621 04-02-2009 680845 07:35
80622 04-02-2009 680845 08:35
80623 04-02-2009 680845 09:35
80624 04-02-2009 680845 10:35

print(event3)
                    Date  IDNum  Time
        80626 04-02-2009 680845 12:35
        80627 04-02-2009 680845 13:35
        80628 04-02-2009 680845 14:35
        80629 04-02-2009 680845 15:35
        80630 04-02-2009 680845 16:35

我的目标是创建一个循环，该循环将从match_dates（在本例中为147）获取匹配日期的每个实例，从df2创建147个对应的records，然后使用日期、IDNum、开始，和每个records 中的结束时间到子集df1 并输出df1 事件。到目前为止我所拥有的（那是行不通的）：

records <- matrix(ncol=6, nrow=nrow(df1)) # Create an empty matrix to start
event=NULL
for (i in 1:length(match_dates)) 
    { records[i] <- df2[which(df2$Date == match_dates[i]), ]

    for (j in 1:nrow(records[i]))
    { event[j] <- subset(df1, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
      }
}
print(event)

Error in 1:nrow(records[i]) : argument of length 0
In addition: Warning message:
In records[i] <- df2[which(df2$Date == match_dates[i]), ] :
  number of items to replace is not a multiple of replacement length
> print(event)
NULL

提前感谢您的帮助！我正在为此撞墙。

编辑/更新：

我把records改成了

records <- subset(df2, Date %in% df1$Date)

然后编写一个函数，将df1中的匹配行子集化为

event_func <- function(df,records,i){
  event_int <- subset(df, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
  return(event_int)
}

此功能有效，并输出我需要的内容。但是我仍然无法处理一个循环，该循环将获取records 的 686 行，将它们与df1 匹配，并输出所有匹配的df1 行的最终数据帧。我也尝试使用lapply 这是我所拥有的（两者都不起作用）：

# First option using a loop
final <- data.frame()
event_int <- data.frame()

for (i in 1:nrow(records)) {
  event_int[i] <- event_func(df1, records,i)
  final <- rbind(event_int, event_int[i])
}

# Second option using lapply
lapply(records, event_func(df1,records,1:nrow(records)))

再次感谢您的帮助！

【问题讨论】：

标签： r loops subset

【解决方案1】：

这里有几个问题。

records[i] 不正确，如果要分配给需要records[i,] 的行
df2[which(df2$Date == match_dates[i]),] 不保证具有任何特定大小，并且通过将其在循环中分配给 records[i,] 您正在对其大小进行假设。您可以分配一个中间值并使用另一个循环将其放入records，或者更好的是在循环的每次迭代中使用rbind函数，这将消除预先分配records大小的需要
尝试将 data.frame (df2) 分配给矩阵 (records) 而不进行任何转换是自找麻烦。 records 应该是这里的 data.frame。

一个更简单的方法是通过 %in% 接口使用match() 函数

records <- subset(df2,Date %in% df1$Date)

【讨论】：

感谢@NGaffney，这是一种更好的记录方式，并为我提供了所需的所有值。但是你能举一个用循环分配中间值以放入记录的例子吗？
我在上面的原始问题中更新了代码，但在循环和写入最终数据帧时仍然遇到问题。

【解决方案2】：

终于有工作了！我最终更改了一些原始编码，并从另一篇帖子here 中找到了一个非常有用的循环答案。

1) 我首先通过匹配df1 和df2 之间的ID 和日期来定义records

records <- subset(df1, IDNum %in% df2$IDNum)
records <- subset(records, Date %in% df2$Date)

# Records looks like:
head(records,5)
               Date  IDNum  Time    Speed
    8653 01-10-2009 534198 02:35 4.001809
    8654 01-10-2009 534198 03:42 4.117383
    8655 01-10-2009 534198 04:49 4.263277
    8656 01-10-2009 534198 05:53 4.310865
    8657 01-10-2009 534198 06:55 4.353049

# df2 looks like:
head(df2)
          Date  IDNum   Trip Replicate Start   End
506 01-04-2009 599091 GL0229         1 12:00 17:21
507 01-05-2009 599091 GL0229         2 08:35 15:08
508 01-05-2009 599091 GL0229         3 15:33 22:44
509 01-06-2009 599091 GL0229         4 08:30 15:12
510 01-06-2009 599091 GL0229         5 15:51 09:06
511 01-07-2009 599091 GL0229         6 10:02 19:16

2) 我的子集records 的函数基于与df2 匹配的 ID、日期和时间：

event_func <- function(i,...) {
  event_int <- subset(records, Date==df2[i,"Date"] & IDNum==df2[i,"IDNum"] & Time >= df2[i,"Start"] & Time <= df2[i,"End"])
  output <- event_int
  return(output)
}

# For example, subsetting records based on the first row of df2
event_func(1)
            Date  IDNum  Time    Speed
38613 01-04-2009 599091 12:24 1.611527
38614 01-04-2009 599091 15:58 1.545299
38615 01-04-2009 599091 17:02 1.527205

3) 我在 df2 的所有 686 行中重复了 event_func，并使用 foreach 包将结果放入单个数据框中。

library(foreach)
final.match <- foreach(i = 1:nrow(df2), .combine=rbind) %do% {
  event_func(i)}

final.match 的输出是一个包含 4 列和 1634 行的单个数据框，这正是我想要的！

【讨论】：