您说信息是按时间顺序排列的,所以这样做的方法是简单地为用户的每次搜索提取下一条记录。下面的代码就是这样做的
#assign proper row index column
desktop$row_index <- 1:nrow(desktop)
data_google <- dplyr::filter(desktop, grepl('\\bgoogle.com/search\\b', desktop$url, ignore.case = T))
data_google 中的行对应于 google 搜索 url。要获取用户访问的 url(可能是 google 搜索中的结果),您基本上会从桌面中获取该搜索 url 之后、但在下一个搜索 url 之前的最小 row_index 行。
names(data_google) <- c("search_url","user_id","search_row_index")
temp <- merge(desktop, data_google, by = "user_id")
temp <- temp[order(temp$user_id),]
#from temp, remove the rows with search_row_index >= row_index, since we are interested in url AFTER the search
temp <- temp[which(! temp$search_row_index >= temp$row_index),]
#now for each user and search_row_index, simply take the row with minimum row_index,
#that would be the very next url visited after each of the search by the user
right_after_search_data <- as.data.frame(temp %>%
group_by(user_id,search_row_index) %>%
filter(row_index==min(row_index)))