在特定列中与 grepl 匹配的模式后过滤行答案

【问题标题】：Filter rows after pattern matched with grepl in certain column在特定列中与 grepl 匹配的模式后过滤行
【发布时间】：2021-08-03 18:48:09
【问题描述】：

我有一个数据集（名为 desktop），其中包含来自网络跟踪器的按时间顺序排列的信息，其中一列中包含不同用户访问的 URL，另一列中包含用户 ID。以搜索引擎分析为目标，我试图过滤所有包含用户向谷歌提交搜索查询的 URL 的行，我可以使用以下代码行：

data_google <- dplyr::filter(desktop, grepl('\\bgoogle.com/search\\b', desktop$url, ignore.case = T))

这很好用。但是，我不仅对包含搜索查询的 URL 感兴趣，而且对用户在提交查询后访问的网页感兴趣。换句话说，用户实际点击的 google 结果页面的链接。

是否可以不仅过滤 url 匹配模式的行，还过滤该行之后的行？

任何帮助将不胜感激，谢谢

【问题讨论】：

标签： r filter dplyr grepl

【解决方案1】：

以鸢尾花数据集为例。我将推入所有以“set”开头的物种，然后得到它之后的行。这是一个非常简单的示例，但在您的情况下应该可以实现您的目标。

vec1 <- which(grepl("set", iris$Species))

vec2 <- vec1+1
vec3 <- unique(c(vec1,vec2))

iris[vec3,]

如果您在组内需要它，请编辑以下解决方案。使用我排序的钻石数据集来模拟您的订单，然后按切割分组并找到颜色包含“E”的位置，然后您可以在第一个标志变量上使用lag 来获取它之后的行，它尊重group_by()

diamonds2 <- diamonds %>% 
             arrange(cut) %>% 
             group_by(cut) %>%
            mutate(
                   fl = ifelse(rownm %in% which(grepl("E",color)),1,0 ),
                   fl2 = lag(fl)) %>% 
            filter(fl ==1 | fl2 ==1
                    )

【讨论】：

【解决方案2】：

您说信息是按时间顺序排列的，所以这样做的方法是简单地为用户的每次搜索提取下一条记录。下面的代码就是这样做的

#assign proper row index column
desktop$row_index <- 1:nrow(desktop) 
data_google <- dplyr::filter(desktop, grepl('\\bgoogle.com/search\\b', desktop$url, ignore.case = T))

data_google 中的行对应于 google 搜索 url。要获取用户访问的 url（可能是 google 搜索中的结果），您基本上会从桌面中获取该搜索 url 之后、但在下一个搜索 url 之前的最小 row_index 行。

names(data_google) <- c("search_url","user_id","search_row_index")
temp <- merge(desktop, data_google, by = "user_id")
temp <- temp[order(temp$user_id),]
#from temp, remove the rows with search_row_index >= row_index, since we are interested in url AFTER the search
temp <- temp[which(! temp$search_row_index >= temp$row_index),]
#now for each user and search_row_index, simply take the row with minimum row_index, 
#that would be the very next url visited after each of the search by the user
right_after_search_data <- as.data.frame(temp %>% 
                                         group_by(user_id,search_row_index) %>% 
                                         filter(row_index==min(row_index)))

【讨论】：