根据正则表达式模式和 ID 过滤行答案

【问题标题】：Filter rows based on regex pattern and ID根据正则表达式模式和 ID 过滤行
【发布时间】：2021-02-18 19:54:51
【问题描述】：

我有一个这样的df：

df <- data.frame(
  id = c("A", "A", "B", NA, "A", "B", "B", "B"),
  speech = c("hi", "how are you [Larry]?", "[uh]", "(0.123)", "I'm fine [you 'n Mary] how's it [goin]?", "[erm]", "(0.4)", "well")
)

我想过滤掉那些行 (1) 其中speech 完全由一个表达式组成跟随[...] 构成整个speech 的行。我知道如何用[...] 过滤掉行：

df %>%
  group_by(grp = rleid(id)) %>%
  filter(grepl("^\\[.*?\\]$", speech))

但我不知道如何过滤掉[...] 行之后的相同-ID 行。 想要的输出是这样的：

df
  id speech
1  B   [uh]
2  B  [erm]
3  B  (0.4)
4  B   well

【问题讨论】：

'hi' 不在[...] 中，并且不来自同一个-ID speech，其第一个元素是[...]

标签： r filter dplyr

【解决方案1】：

在 OP 的代码中使用 rleid 创建分组索引，然后 filter 在“语音”的 first 元素中没有 [ 的组，ungroup

library(dplyr)
library(data.table)
library(stringr)
df %>% 
    group_by(grp = rleid(id)) %>% 
    filter(str_detect(first(speech), "^\\[")) %>% 
    ungroup  %>%
    select(-grp)

-输出

# A tibble: 4 x 2
#  id    speech
#  <chr> <chr> 
#1 B     [uh]  
#2 B     [erm] 
#3 B     (0.4) 
#4 B     well

编辑：基于@ChrisRuehlemann 的 cmets

【讨论】：

看来df %>% group_by(grp = rleid(id)) %>% filter(str_detect(first(speech), "^\\[")) 已经完成了这项工作！
@ChrisRuehlemann 但这也只返回您不想要的“A”行
没有。注意模式中添加的^。
@ChrisRuehlemann 是的，你是对的。我正在考虑单独提取(