【问题标题】：How can get Python isidentifer() functionality in Python 2.6?如何在 Python 2.6 中获得 Python isidentifer() 功能？
【发布时间】：2011-02-02 10:49:25
【问题描述】：

Python 3 有一个名为 str.isidentifier 的字符串方法

如何在 Python 2.6 中获得类似的功能，而不是重写我自己的正则表达式等？

【问题讨论】：

标签： python python-3.x python-2.6 identifier

【解决方案1】：

re.match(r'[a-z_]\w*$', s, re.I)

应该做得很好。据我所知，没有任何内置方法。

【讨论】：

请注意，这不适用于 unicode 符号，例如'éllo'

【解决方案2】：

在 Python

import re
import keyword

def isidentifier(s):
    if s in keyword.kwlist:
        return False
    return re.match(r'^[a-z_][a-z0-9_]*$', s, re.I) is not None

【讨论】：

很好，但标识符可以以下划线开头。
是的，我在写这篇文章时正在考虑这个问题。

【解决方案3】：

tokenize 模块定义了一个名为 Name 的正则表达式

import re, tokenize, keyword
re.match(tokenize.Name + '$', somestr) and not keyword.iskeyword(somestr)

【讨论】：

为了这些目的，你需要'^'+tokenize.Name+'$'。
添加检查python保留字，我喜欢这个。
@Douglas S.J. De Couto - 您可以使用import keyword、keyword.iskeyword(astring) 来检查字符串是否为关键字，请参阅其文档here。
@JasonR.Coombs，说得好。终于更新了这个答案:) ^ 是通过使用 re.match 而不是 re.search 来处理的
我认为这是不正确的；它也会触发数字。至少在我的 cPython 3.4 实现中，token.Name 是 '\\w+'

【解决方案4】：

到目前为止，答案很好。我会这样写。

import keyword
import re

def isidentifier(candidate):
    "Is the candidate string an identifier in Python 2.x"
    is_not_keyword = candidate not in keyword.kwlist
    pattern = re.compile(r'^[a-z_][a-z0-9_]*$', re.I)
    matches_pattern = bool(pattern.match(candidate))
    return is_not_keyword and matches_pattern

【讨论】：

需要允许大写字母。上面的答案有类似的问题。
@Douglas：这就是re.I 标志的用途。

【解决方案5】：

我正在使用什么：

def is_valid_keyword_arg(k):
    """
    Return True if the string k can be used as the name of a valid
    Python keyword argument, otherwise return False.
    """
    # Don't allow python reserved words as arg names
    if k in keyword.kwlist:
        return False
    return re.match('^' + tokenize.Name + '$', k) is not None

【讨论】：

re.match 从行首匹配。

【解决方案6】：

我决定再试一次，因为有几个很好的建议。我会努力巩固它们。以下可以保存为 Python 模块并直接从命令行运行。如果运行，它会测试功能，因此可以证明是正确的（至少在文档证明该功能的范围内）。

import keyword
import re
import tokenize

def isidentifier(candidate):
    """
    Is the candidate string an identifier in Python 2.x
    Return true if candidate is an identifier.
    Return false if candidate is a string, but not an identifier.
    Raises TypeError when candidate is not a string.

    >>> isidentifier('foo')
    True

    >>> isidentifier('print')
    False

    >>> isidentifier('Print')
    True

    >>> isidentifier(u'Unicode_type_ok')
    True

    # unicode symbols are not allowed, though.
    >>> isidentifier(u'Unicode_content_\u00a9')
    False

    >>> isidentifier('not')
    False

    >>> isidentifier('re')
    True

    >>> isidentifier(object)
    Traceback (most recent call last):
    ...
    TypeError: expected string or buffer
    """
    # test if candidate is a keyword
    is_not_keyword = candidate not in keyword.kwlist
    # create a pattern based on tokenize.Name
    pattern_text = '^{tokenize.Name}$'.format(**globals())
    # compile the pattern
    pattern = re.compile(pattern_text)
    # test whether the pattern matches
    matches_pattern = bool(pattern.match(candidate))
    # return true only if the candidate is not a keyword and the pattern matches
    return is_not_keyword and matches_pattern

def test():
    import unittest
    import doctest
    suite = unittest.TestSuite()
    suite.addTest(doctest.DocTestSuite())
    runner = unittest.TextTestRunner()
    runner.run(suite)

if __name__ == '__main__':
    test()

【讨论】：

【解决方案7】：

目前提出的所有解决方案都不支持 Unicode 或允许在第一个字符中包含数字如果在 Python 3 上运行。

编辑：建议的解决方案只能在 Python 2 上使用，而在 Python3 上应该使用 isidentifier。这是一个适用于任何地方的解决方案：

re.match(r'^\w+$', name, re.UNICODE) and not name[0].isdigit()

基本上，它会测试某个内容是否由（至少 1 个）字符（包括数字）组成，然后检查第一个字符是否不是数字。

【讨论】：

【解决方案8】：

标识符验证无效

此线程中的所有答案似乎都在重复验证中的错误，该错误允许无效标识符的字符串被匹配。

其他答案中建议的正则表达式模式是从 tokenize.Name 构建的，它包含以下正则表达式模式 [a-zA-Z_]\w*（运行 python 2.7.15）和“$”正则表达式锚。

请参考official python 3 description of the identifiers and keywords（其中也包含与python 2相关的一段）。

在 ASCII 范围内 (U+0001..U+007F)，标识符的有效字符与 Python 2.x 中的相同：大写和小写字母 A 到 Z、下划线 _ 和，除了第一个字符，数字 0 到 9。

因此 'foo\n' 不应被视为有效标识符。

虽然有人可能会争辩说这段代码是功能性的：

>>>  class Foo():
>>>     pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar

由于换行符确实是一个有效的 ASCII 字符，它不被认为是一个字母。此外，以换行符结尾的标识符显然没有实际用途

>>> f.foo\n
SyntaxError: unexpected character after line continuation character

str.isidentifier 函数还确认这是一个无效标识符：

python3 解释器：

>>> print('foo\n'.isidentifier())
False

`$` 锚与 `\Z` 锚

引用official python2 Regular Expression syntax：

$

匹配字符串的结尾或字符串末尾的换行符之前，并且在 MULTILINE 模式下也匹配换行符之前。 foo 匹配“foo”和“foobar”，而正则表达式 foo$ 只匹配“foo”。更有趣的是，在 'foo1\nfoo2\n' 中搜索 foo.$ 通常匹配 'foo2'，但在 MULTILINE 模式下搜索 'foo1'；在 'foo\n' 中搜索单个 $ 将找到两个（空）匹配项：一个在换行符之前，一个在字符串末尾。

这会产生一个以换行符结尾的字符串以匹配为有效标识符：

>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '$', 'foo\n')
<_sre.SRE_Match at 0x3eac8e0>
>>> print m.group()
'foo'

正则表达式模式不应使用$ 锚点，而应使用\Z 锚点。再次引用：

\Z

只匹配字符串的末尾。

现在正则表达式是有效的：

>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True

危险的影响

请参阅Luke's answer 了解另一个示例，这种弱正则表达式匹配在其他情况下可能会产生更危险的影响。

进一步阅读

Python 3 添加了对非 ascii 标识符的支持，请参阅 PEP-3131。

【讨论】：

标识符验证无效

$ 锚与 \Z 锚

危险的影响

进一步阅读

`$` 锚与 `\Z` 锚