Python 中的正则表达式 match.start() 和 match.end() 示例答案

【问题标题】：Regular expressions match.start() and match.end() example in PythonPython 中的正则表达式 match.start() 和 match.end() 示例
【发布时间】：2016-07-31 23:32:28
【问题描述】：

努力掌握正则表达式，尤其是它们的match.start() 和match.end() 方法。

在玩这个代码时（找到here）：

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(code):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN',  r':='),          # Assignment operator
        ('END',     r';'),           # Statement terminator
        ('ID',      r'[A-Za-z]+'),   # Identifiers
        ('OP',      r'[+\-*/]'),     # Arithmetic operators
        ('NEWLINE', r'\n'),          # Line endings
        ('SKIP',    r'[ \t]+'),      # Skip over spaces and tabs
        ('MISMATCH',r'.'),           # Any other character
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)
        if kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
        elif kind == 'SKIP':
            pass
        elif kind == 'MISMATCH':
            raise RuntimeError('%r unexpected on line %d' % (value, line_num))
        else:
            if kind == 'ID' and value in keywords:
                kind = value
            column = mo.start() - line_start
            yield Token(kind, value, line_num, column)

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

我无法理解使用mo.end() 和mo.start() 计算行和列时的用途和逻辑。例如，如果我让NEWLINE 和SKIP 也产生Token 输出，那么列索引就会完全混乱。尝试使用 mo.end() 重新计算列索引以适应示例中提到的这种情况，但失败了。任何想法、示例代码和/或解释都会很棒。

【问题讨论】：

如果您在docs 看到一些例子（虽然我必须说我也不太了解）。
谢谢，我见过他们，但并没有变得更明智，以便实施我在描述中提到的示例案例：/
我已经更新了令牌以匹配文档，以便您提供的代码可以正确运行：如果这是您故意遗漏的，请告诉我

标签： python regex python-3.x

【解决方案1】：

这是一个我认为满足您的标准的实现：如果您可以发布您尝试过的内容，也许我们可以更好地理解您遇到的问题。

import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(code):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN',  r':='),          # Assignment operator
        ('END',     r';'),           # Statement terminator
        ('ID',      r'[A-Za-z]+'),   # Identifiers
        ('OP',      r'[+\-*/]'),     # Arithmetic operators
        ('NEWLINE', r'\n'),          # Line endings
        ('SKIP',    r'[ \t]+'),      # Skip over spaces and tabs
        ('MISMATCH',r'.'),           # Any other character
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)
        column = (mo.start() - line_start) + 1

        if kind == 'MISMATCH':
            raise RuntimeError('%r unexpected on line %d' % (value, line_num))
        else:
            if kind == 'ID' and value in keywords:
                kind = value
            yield Token(kind, value, line_num, column)
            if kind == 'NEWLINE':
                line_start = mo.end()
                line_num += 1


statements = '''
    IF quantity THEN 
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

输出：

Token(typ='NEWLINE', value='\n', line=1, column=1)
Token(typ='SKIP', value='    ', line=2, column=1)
Token(typ='IF', value='IF', line=2, column=5)
Token(typ='SKIP', value=' ', line=2, column=7)
Token(typ='ID', value='quantity', line=2, column=8)
Token(typ='SKIP', value=' ', line=2, column=16)
Token(typ='THEN', value='THEN', line=2, column=17)
Token(typ='SKIP', value=' ', line=2, column=21)
Token(typ='NEWLINE', value='\n', line=2, column=22)
Token(typ='SKIP', value='        ', line=3, column=1)
Token(typ='ID', value='total', line=3, column=9)
Token(typ='SKIP', value=' ', line=3, column=14)
Token(typ='ASSIGN', value=':=', line=3, column=15)
Token(typ='SKIP', value=' ', line=3, column=17)
Token(typ='ID', value='total', line=3, column=18)
Token(typ='SKIP', value=' ', line=3, column=23)
Token(typ='OP', value='+', line=3, column=24)
Token(typ='SKIP', value=' ', line=3, column=25)
Token(typ='ID', value='price', line=3, column=26)
Token(typ='SKIP', value=' ', line=3, column=31)
Token(typ='OP', value='*', line=3, column=32)
Token(typ='SKIP', value=' ', line=3, column=33)
Token(typ='ID', value='quantity', line=3, column=34)
Token(typ='END', value=';', line=3, column=42)
Token(typ='NEWLINE', value='\n', line=3, column=43)
Token(typ='SKIP', value='        ', line=4, column=1)
Token(typ='ID', value='tax', line=4, column=9)
Token(typ='SKIP', value=' ', line=4, column=12)
Token(typ='ASSIGN', value=':=', line=4, column=13)
Token(typ='SKIP', value=' ', line=4, column=15)
Token(typ='ID', value='price', line=4, column=16)
Token(typ='SKIP', value=' ', line=4, column=21)
Token(typ='OP', value='*', line=4, column=22)
Token(typ='SKIP', value=' ', line=4, column=23)
Token(typ='NUMBER', value='0.05', line=4, column=24)
Token(typ='END', value=';', line=4, column=28)
Token(typ='NEWLINE', value='\n', line=4, column=29)
Token(typ='SKIP', value='    ', line=5, column=1)
Token(typ='ENDIF', value='ENDIF', line=5, column=5)
Token(typ='END', value=';', line=5, column=10)
Token(typ='NEWLINE', value='\n', line=5, column=11)

【讨论】：

为什么column 1 比它应该的少？例如第一个 Token 应该有第 1 列而不是 0。有什么想法可以解决这个问题吗？
@Karim 因为match.start 使用0 作为第一个“列”，而不是1。这在编程中很正常。打印出来的时候加1就行了
@Karim 您会注意到这也适用于原始代码：这并不明显，因为您从未在“0”列看到打印出来的内容。但是 IF 打印在 4 列而不是 5 列例如
所以在每个yield 之前你将column 加一？
是的，只是yield Token(kind, value, line_num, column+1)

【解决方案2】：

mo.start 和 mo.end 将返回匹配的开始和结束索引，以便 string[mo.start():mo.end()] 将返回匹配的字符串。每次您的示例匹配 \n 时，它都会增加跟踪当前行的 line_num 并更新 line_start 以包含当前行中第一个字符的索引。这允许程序稍后在令牌匹配时计算列：column = mo.start() - line_start。

为了说明行和列跟踪行为，我创建了一个简单的示例来查找给定字符串中的所有数字。对于每个数字，它将输出行和起始列：

import re

PATTERN = '(?P<NEWLINE>\n)|(?P<NUMBER>\d+)'
s = '''word he12re 5 there
mo912re
another line 17
'''

line = 1
line_start = 0
for mo in re.finditer(PATTERN, s):
    if mo.lastgroup == 'NEWLINE':
        # Found new line, increase line number and change line_start to
        # contain index of first character on the line
        line += 1
        line_start = mo.end()
    elif mo.lastgroup == 'NUMBER':
        # Column: index of start of the match - index of first char on line
        column = mo.start() - line_start
        print('line {0}: {1} at column {2}'.format(line, mo.group(0), column))

输出：

line 1: 12 at column 7
line 1: 5 at column 12
line 2: 912 at column 2
line 3: 17 at column 13

【讨论】：