查找数据帧行号的有效方法，不等条件答案

【问题标题】：An efficient way to find the row number of a data frame, unequal condition查找数据帧行号的有效方法，不等条件
【发布时间】：2016-01-08 02:49:27
【问题描述】：

我们正在研究只能同时照顾一个客户的服务器的延迟。假设我们有两个数据框：agg_data 和 ind_data。

> agg_data
  minute service_minute
1      0    1
2     60    3
3    120    2
4    180    3
5    240    2
6    300    4

agg_data 每小时提供两个连续客户之间的服务时间。例如，在 60 到 120 之间（从开始算起的第二个小时），我们可以每 3 分钟为一位新客户提供服务，并且在该给定小时内我们总共可以为 20 位客户提供服务。

ind_data 提供每位客户的到达分钟数：

         Arrival
1             51
2             63
3            120
4            121
5            125
6            129

我需要为受agg_data 中的service_minute 影响的客户生成出发时间。

输出如下：

         Arrival              Dep
1             51               52
2             63               66
3            120              122
4            121              124
5            125              127
6            129              131

这是我目前的代码，正确但效率很低：

ind_data$Dep = rep(0,now(ind_data))
# After the service time, the first customer can leave the system with no delay
# Service time is taken as that of the hour when the customer arrives
ind_data$Dep[1] = ind_data$Arrival[1] + agg_data[max(which(agg_data$minute<=ind_data$Arrival[1])),'service_minute']

# For customers after the first one, 
# if they arrive when there is no delay (arrival time > departure time of the previous customer), 
# then the service time is that of the hour when the arrive and 
# departure time is arrival time + service time; 
# if they arrive when there is delay (arrival time < departure time of the previous customer), 
# then the service time is that of the hour when the previous customer leaves the system and 
# the departure time is the departure time of the previous customer + service time.

for (i in 2:nrow(ind_data)){
ind_data$Dep[i] = max(
ind_data$Dep[i-1] + agg_data[max(which(agg_data$minute<=ind_data$Dep[i-1])),'service_minute'],
ind_data$Arrival[i] + agg_data[max(which(agg_data$minute<=ind_data$Arrival[i])),'service_minute']
                )
}

我认为这是我们在agg_data 中寻找合适的服务时间的步骤需要很长时间。有没有更高效的算法？

谢谢。

【问题讨论】：

如果在 60-120 之间出现 20 个或更多到达会怎样？
第 4 位和第 5 位客户的情况相同。会有一个队列（延迟）。顾客将按照到达的顺序得到服务。服务的开始时间是客户的到达时间和前一个客户的离开时间的最大值。该客户的服务时间为开始时的服务时间。

标签： r search indexing

【解决方案1】：

这应该是相当有效的。这是一个非常简单的查找问题，具有明显的矢量化解决方案：

out <- data.frame(Arrival = ind_data$Arrival,
         Dep = ind_data$Arrival + agg_data$service_minute[ # need an index to choose min
                              findInterval(ind_data$Arrival, agg_data$minute)] 
 )

> out
  Arrival Dep
1      51  52
2      63  66
3     120 122
4     121 123
5     125 127
6     129 131

我比你的例子更信任我的代码。我认为其中有明显的错误。

【讨论】：

输出错误。第 4 位顾客的出发时间应该是 124 而不是 123。顾客 3 在 122 离开系统，服务可以在 122 开始照顾顾客 4。120 和 180 之间的服务时间是 2 分钟，所以顾客 4 离开在 124。当系统到达 121 时，系统无法照顾客户 4，因为它正在照顾客户 4。
好的，这应该可以通过调用max 来解决（并且for-loop 可能是最好的策略），但是第5 和第6 个客户呢？你如何判断效率？
同样的逻辑适用于第 5 位和第 6 位客户。出发时间与问题帖子中的一样（我在这两个客户的出发时间上犯了一个错误，我已经编辑了我的帖子。也许这个错误首先让你感到困惑。对不起。）效率是指计算时间。