奇怪的行为正则表达式答案

【问题标题】：Strange behavior regular expressions奇怪的行为正则表达式
【发布时间】：2015-05-25 12:27:42
【问题描述】：

我正在编写一个程序来从汇编中的源代码生成令牌，但我遇到了一个奇怪的问题。

有时代码按预期工作，有时则不然！

这是代码（变量是葡萄牙语，但我放了翻译）：

import re

def tokenize(code):
    tokens = []

    tokens_re = {
    'comentarios'  : '(//.*)',                         # comentary
    'linhas'       : '(\n)',                           # lines
    'instrucoes'   : '(add)',                          # instructions
    'numeros_hex'  : '([-+]?0x[0-9a-fA-F]+)',          # hex numbers
    'numeros_bin'  : '([-+]?0b[0-1]+)',                # binary numbers
    'numeros_dec'  : '([-+]?[0-9]+)'}                  # decimal numbers

    #'reg32'        : 'eax|ebx|ecx|edx|esp|ebp|eip|esi',
    #'reg16'        : 'ax|bx|cx|dx|sp|bp|ip|si',
    #'reg8'         : 'ah|al|bh|bl|ch|cl|dh|dl'}

    pattern = re.compile('|'.join(list(tokens_re.values())))
    scan = pattern.scanner(code)

    while 1:
        m = scan.search()
        if not m:
            break

        tipo = list(tokens_re.keys())[m.lastindex-1]     # type
        valor = repr(m.group(m.lastindex))               # value

        if tipo == 'linhas':
            print('')

        else:
            print(tipo, valor)

    return tokens



code = '''
add eax, 5 //haha
add ebx, -5
add eax, 1234
add ebx, 1234
add ax, 0b101
add bx, -0b101
add al, -0x5
add ah, 0x5
'''

print(tokenize(code))

这里是预期的结果：

instrucoes 'add'
numeros_dec '5'
comentarios '//haha'

instrucoes 'add'
numeros_dec '-5'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_bin '0b101'

instrucoes 'add'
numeros_bin '-0b101'

instrucoes 'add'
numeros_hex '-0x5'

instrucoes 'add'
numeros_hex '0x5'

问题是代码没有变化，有时它会给出预期的结果，但有时是这样的：

instrucoes 'add'
numeros_dec '5'
comentarios '//haha'

instrucoes 'add'
numeros_dec '-5'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '0'
numeros_dec '101'

instrucoes 'add'
numeros_dec '-0'
numeros_dec '101'

instrucoes 'add'
numeros_dec '-0'
numeros_dec '5'

instrucoes 'add'
numeros_dec '0'
numeros_dec '5'

问题出在哪里？

【问题讨论】：

总是将你的正则表达式定义为原始字符串。
@AvinashRaj 感谢您的提示！但仍然无法正常工作。

标签： python regex python-3.x tokenize

【解决方案1】：

您从字典中构建您的正则表达式。字典没有排序，因此正则表达式模式有时会有所不同，从而产生不同的结果。

如果您想要“稳定”的结果，我建议您使用sorted(tokens_re.values()) 或在列表/元组而不是字典中指定它们。

例如，您可以将它们指定为对列表，然后使用该列表来构建模式以及构建字典：

tokens_re = [
    ('comentarios', '(//.*)'),                         # comentary
    ('linhas',      '(\n)'),                           # lines
    ('instrucoes',  '(add)'),                          # instructions
    ('numeros_hex', '([-+]?0x[0-9a-fA-F]+)'),          # hex numbers
    ('numeros_bin', '([-+]?0b[0-1]+)'),                # binary numbers
    ('numeros_dec', '([-+]?[0-9]+)'),                  # decimal numbers
]
pattern = re.compile('|'.join(p for _, p in tokens_re))
tokens_re = dict(tokens_re)

【讨论】：

我明白了！那么你能告诉我做我想做的最好的方法是什么吗？
对值进行排序，或者在列表/元组中指定它们，或者直接构建整个字符串。你的选择，真的，你最喜欢的。
很高兴听到 :-) 另外，我在答案中添加了一个明确的建议，让您可以指定所需的顺序，并且仍然可以从中获取字典。