在正则表达式中使用 Unicode 范围作为 Lex 的规则

【问题标题】：Using Unicode Range in Regex as a rule to Lex在正则表达式中使用 Unicode 范围作为 Lex 的规则
【发布时间】：2013-11-16 07:53:45
【问题描述】：

import re
import ply.lex as lex

#rest of the code

def t_WORD(t): #WORD is a token defined in the tokens tuple
    r'[\u0C80-\u0CFF]+'
    #rest of the actions

这个 sn-p 提供了一个错误说明非法字符。所有字符都在正则表达式规则中指定的 unicode 范围内。

可能是什么问题？提前致谢。

【问题讨论】：

标签： regex unicode python-3.x lex

【解决方案1】：

词法分析器应该与作为令牌和模式匹配规则给出的 Unicode 一起正常工作。如果您需要为 re.compile() 函数提供可选标志，请使用 lex 的 reflags 选项。

lex.lex(reflags=re.UNICODE)

作为替代方案，请参阅 How to validate kannada words 和 Python Lex-Yacc

【讨论】：