在R中的所有括号内提取信息答案

【问题标题】：Extract info inside all parenthesis in R在R中的所有括号内提取信息
【发布时间】：2012-01-26 15:31:57
【问题描述】：

我有一个字符串以及要在多个括号内提取信息的内容。目前我可以使用下面的代码从最后一个括号中提取信息。我该怎么做才能提取多个括号并作为向量返回？

j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"                                                          
sub("\\).*", "", sub(".*\\(", "", j))

当前输出为：

[1] "Laugh"

期望的输出是：

[1] "wonder" "groan"  "Laugh"

【问题讨论】：

标签： regex r

【解决方案1】：

这是一个例子：

> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan"  "Laugh"

我认为这应该很好用：

> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)"  "(Laugh)"

但结果包含括号...为什么？

这行得通：

regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]

感谢@MartinMorgan 的评论。

【讨论】：

注意：这适用于矢量，但不适用于数据框列中的文本。
@kohske & @SamFirke：我真的很好奇，像regmatches(j, gregexpr("\\(.*?\\)", text = j)) 这样的东西不是很好吗？为什么你使用环视来解决这个问题？谢谢
@AudileF 使其适用于数据框中的列，使用 unlist(regmatches(...)) 在将其重新分配回数据框的列之前。

【解决方案2】：

使用 stringr 包我们可以减少一点。

library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)

@kohske 使用 regmatches，但我目前使用的是 2.13，因此目前无法访问该功能。这增加了对 stringr 的依赖，但我认为它更容易使用并且代码更清晰一些（嗯......就像使用正则表达式一样清晰......）

编辑：我们也可以尝试这样的事情 -

re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])

这是通过在正则表达式中定义一个标记的子表达式来工作的。它提取与正则表达式匹配的所有内容，然后 gsub 仅提取子表达式内的部分。

【讨论】：

【解决方案3】：

我认为在 R 中提取多个捕获组基本上有三种简单的方法（不使用替换）； str_match_all、str_extract_all 和 regmatches/gregexpr 组合。

我喜欢@kohske 的正则表达式，它在后面寻找一个左括号?<=\\(，在前面寻找一个右括号?=\\)，并抓住中间的所有东西（懒惰地）.+?，换句话说(?<=\\().+?(?=\\))

使用相同的正则表达式：

str_match_all 将答案作为矩阵返回。

str_match_all(j, "(?<=\\().+?(?=\\))")

     [,1]    
[1,] "wonder"
[2,] "groan" 
[3,] "Laugh" 

# Subset the matrix like this....

str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan"  "Laugh"

str_extract_all 以列表的形式返回答案。

str_extract_all(j,  "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan"  "Laugh" 

#Subset the list...
str_extract_all(j,  "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan"  "Laugh"

regmatches/gregexpr 还将答案作为列表返回。由于这是基本 R 选项，因此有些人更喜欢它。注意推荐的perl = TRUE。

regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan"  "Laugh" 

#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan"  "Laugh"

如果我错误地描述了最受欢迎的选项，希望 SO 社区能够更正/编辑此答案。

【讨论】：

【解决方案4】：

使用rex 可能会使此类任务更简单一些。

matches <- re_matches(j,
  rex(
    "(",
    capture(name = "text", except_any_of(")")),
    ")"),
  global = TRUE)

matches[[1]]$text
#>[1] "wonder" "groan"  "Laugh"

【讨论】：