【问题标题】:Python parser combinator for capturing text inside delimiters用于捕获分隔符内的文本的 Python 解析器组合器
【发布时间】:2020-02-20 01:31:18
【问题描述】:

我正在查看 Python 中的一些解析器组合器库(更准确地说是Parsy),我目前面临以下问题,通过下面的最小工作示例进行了简化:

text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''

start, stop = r"STARTS?", r"STOPS?"
s = section(text, start, stop)

print(s)

应该输出:

 THE TEXT HERE SHOULD 
BE CAPTURED 

我正在使用的当前解决方案是通过执行正则表达式前瞻,它工作正常,但我最初的问题涉及组合许多这些小正则表达式,这可能会变得混乱,并且其他人以后维护的问题。

from typing import Pattern, TypeVar
import re

# A Generic type declaration.
T = TypeVar("T")

def first(text: str, pattern: str, default: T, flags=0) -> T:
    """
    Given a `text`, a regex `pattern` and a `default` value, return the first match
    in `text`. Otherwise return a `default` value if no match is found.
    """
    match = re.findall(pattern, text, flags=flags)
    return match[0] if len(match) > 0 else default

def section(text: str, begin: str, end: str) -> str:
    """
    Given a `text` and two `start` and `stop` regexes, return the captured group
    found in the interval. Otherwise, return an empty string if no match is found.
    """
    return first(text, fr"{begin}([\s\S]*?)(?={end})", default="")

解析器组合器似乎非常适合此类情况,但我无法重现与工作解决方案相同的行为,欢迎提供任何提示:

# A Simpler example with hardcoded stuff
from parsy import regex, seq, string

text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''

start = regex(r"STARTS?")
middle = regex(r"[\s\S]*").optional()
stop = regex(r"STOPS?")

eol = string("\n")

# Work fine
start.parse("START")
middle.parse("")
stop.parse("STOP")

section = seq(
    start,
    middle,
    stop
)
# Simpler case, breaks
section.parse("START AAA STOP")

给予:

---------------------------------------------------------------------------
ParseError                                Traceback (most recent call last)
<ipython-input-260-fdec112e1648> in <module>
     24 )
     25 # Simpler case, breaks
---> 26 section.parse("START AAA STOP")

~/.venv/lib/python3.8/site-packages/parsy/__init__.py in parse(self, stream)
     88     def parse(self, stream):
     89         """Parse a string or list of tokens and return the result or raise a ParseError."""
---> 90         (result, _) = (self << eof).parse_partial(stream)
     91         return result
     92 

~/.venv/lib/python3.8/site-packages/parsy/__init__.py in parse_partial(self, stream)
    102             return (result.value, stream[result.index:])
    103         else:
--> 104             raise ParseError(result.expected, stream, result.furthest)
    105 
    106     def bind(self, bind_fn):

ParseError: expected 'STOPS?' at 0:14


【问题讨论】:

    标签: python regex python-3.x parsing parser-combinators


    【解决方案1】:

    问题在于 middle 解析器匹配文本直到结尾,因此 stop 解析器没有任何内容可供使用:

    seq(start, middle).parse("START AAA STOP")
    

    打印

    ['START', ' AAA STOP']
    

    避免这种行为的一种解决方案是对middle 正则表达式使用前瞻选项:

    middle = regex(r"[\s\S]*(?=STOP)").optional()
    

    这可确保匹配的文本后跟“STOP”字样。

    或者,您可以使用 Parsy 的 should_fail 方法:

    middle = (regex(r"STOPS?").should_fail("not STOP") >> any_char).many().concat()
    

    【讨论】:

    • 感谢@dan-oneață 的回答!很抱歉延迟接受
    【解决方案2】:

    您是否尝试过使用拆分?

    根据我对您项目要求的理解。我会这样做:

    text = '''
    AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
    BBBBBBB START THE TEXT HERE SHOULD
    BE CAPTURED STOP CCCCCCCCCC CCCCCC
    '''
    # split text at START and take the second part of the text
    # Then split the result by STOP and take the first part of the text
    s = text.split('START')[1].split('STOP')[0]
    print (s)
    

    【讨论】:

    • 不,拆分不会解决更复杂的模式,“START”和“STOP”只是一个简化示例。问题是解析器组合器会是什么样子
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多