【问题标题】:Regex: How to extract text from last parenthesis正则表达式:如何从最后一个括号中提取文本
【发布时间】:2017-06-28 01:59:03
【问题描述】:

什么是正确的正则表达式来提取字符串“(过程)” - 或从括号内的一般文本 - 从下面的字符串中

输入字符串示例是

使用 flutemetamol (18F) 和计算机的正电子发射断层扫描 大脑断层扫描(程序)

另一个例子

尿路感染预防(程序)

可能的方法是:

  • 转到文本末尾,查找第一个左括号并从该位置获取子集到文本末尾

  • 从文本的开头,识别最后一个 '(' 字符并将该位置作为子字符串结束

其他字符串可以(提取不同的“标签”)

[1] "Xanthoma of eyelid (disorder)"                    "Ventricular tachyarrhythmia (disorder)"          
[3] "Abnormal urine odor (finding)"                    "Coloboma of iris (disorder)"                     
[5] "Macroencephaly (disorder)"                        "Right main coronary artery thrombosis (disorder)"

(寻求通用正则表达式)(或者 R 中的解决方案更好)

【问题讨论】:

    标签: r regex


    【解决方案1】:

    如果它是字符串的最后一部分,那么这个正则表达式会这样做:

    /\(([^()]*)\)$/
    

    说明:寻找一个开放的( 并匹配它之间不是() 的所有内容,然后在字符串末尾有一个)

    https://regex101.com/r/cEsQtf/1

    【讨论】:

    • 我一直在寻找这样的解决方案。即使前面有多次迭代,这也成功匹配最后一组括号。优雅的解决方案!
    • 这个解决方案对我来说效果很好,但我遇到了另一种情况,有时我想保留的最后一组括号内有另一个括号。这行得通:FELON IN POSSESSION OF AMMUNITION (ACTUAL POSSESSION) (79023) 得到 79023。这行不通:FAIL TO DISPLAY REGISTRATION - POSSESSION REQUIRED (320.0605(1))。它应该给出 320.0605(1)。知道如何修改此答案以允许嵌套括号。
    • @OscarVanL 我在my answer 中解释了如何做到这一点。
    【解决方案2】:

    sub 可以用正确的正则表达式做到这一点

    Text = c("Positron emission tomography using flutemetamol (18F) 
        with computed tomography of brain (procedure)",
        "Urinary tract infection prophylaxis (procedure)", 
        "Xanthoma of eyelid (disorder)",                    
        "Ventricular tachyarrhythmia (disorder)",          
        "Abnormal urine odor (finding)",                    
        "Coloboma of iris (disorder)",                   
        "Macroencephaly (disorder)",                        
        "Right main coronary artery thrombosis (disorder)")
    sub(".*\\((.*)\\).*", "\\1", Text)
    [1] "procedure" "procedure" "disorder"  "disorder"  "finding"   "disorder" 
    [7] "disorder"  "disorder"
    

    附录:正则表达式的详细解释
    该问题要求在字符串中查找 final 组括号的内容。这个表达式有点混乱,因为它包含了两种不同的括号用法,一种是表示正在处理的字符串中的括号,另一种是设置一个“捕获组”,即我们指定表达式应该返回什么部分的方式.表达式由五个基本单元组成:

    1. Initial .*   - matches everything up to the final open parenthesis. 
       Note that this is relying on "greedy matching"
    2. \\(   ...    \\)   - matches the final set of parentheses. 
       Because ( by itself means something else,  we need to "escape" the 
       parentheses by preceding them with \.  That is we want the regular
       expression to say   \(  ...  \).  However, the way R interprets strings,
       if we just typed \( and \),  R would interpret the \ as escaping the (
       and so interpret this as just ( ... ).  So we escape the backslash.  
       R will interpret   \\(  ... \\)      as \( ... \) meaning the literal
       characters ( & ). 
    3. ( ... )       Inside the pair in part 2
       This is making use of the special meaning of parentheses.  When we
       enclose an expression in parentheses, whatever value is inside them 
       will be stored in a variable for later use. That variable is called 
       \1,  which is what was used in the substitution pattern. Again, is 
       we just wrote \1, R would interpret it as if we were trying to escape
       the 1. Writing \\1 is interpreted as the character \ followed by 1, 
       i.e. \1.
    4. Central .*    Inside the pair in part 3
       This is what we are looking for,  all characters inside the parentheses.
    5. Final   .*
       This is in the expression to match any characters that may follow the 
       final set of parentheses. 
    

    子函数将使用它来用替换模式\1替换匹配的模式(在这种情况下,字符串中的所有字符),即包含第一个(仅在我们的情况下)捕获中的任何内容的变量的内容group - 最后括号内的东西。

    【讨论】:

    • 你能评论解决方案吗?我认为 \\1 指的是正则表达式中的一些已定义元素。它有效,但了解它的工作原理会更好
    • @userJT - 添加回答
    【解决方案3】:

    您实际上可以使用以下内容来提取字符串末尾嵌套括号内的文本:

    x <- c("FELON IN POSSESSION OF AMMUNITION (ACTUAL POSSESSION) (79023)",
    "FAIL TO DISPLAY REGISTRATION - POSSESSION REQUIRED (320.0605(1))")
    sub(".*(\\(((?:[^()]++|(?1))*)\\))$", "\\2", x, perl=TRUE)
    

    请参阅 online R demoregex demo

    详情

    • .* - 除换行符之外的任何零个或多个字符,尽可能多
    • (\(((?:[^()]++|(?1))*)\)) - 捕获组 1(递归发生所必需的):
      • \( - 一个 ( 字符
      • ((?:[^()]++|(?1))*) - 捕获第 2 组(我们的值):除 () 或整个第 1 组模式以外的任何一个或多个字符出现零次或多次
      • \) - 一个 ) 字符
    • $ - 字符串结束。

    因此,当匹配时,整个字符串被替换为第 2 组的值。如果没有匹配,则字符串保持原来的状态。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-05-14
      • 2011-01-25
      • 1970-01-01
      相关资源
      最近更新 更多