python基础之正则表达式

正则表达式语法

正则表达式（或 RE）指定一组字符串匹配它;在此模块中的功能让您检查一下，如果一个特定的字符串匹配给定的正则表达式（或给定的正则表达式匹配特定的字符串，可归结为同一件事）。

正则表达式可以连接到形式新的正则表达式; 如果A 和 B 两个都是正则表达式, 那么 AB i也是正则表达式。

本模块提供了类似于那些在 Perl 中找到的正则表达式匹配操作。

类似地，要求一个取代时，替换字符串必须是同一类型的图案和搜索字符串两者。

\\一个普通的Python字符串里。

通常的模式将在Python代码使用这种原始字符串表示法表示。

该功能是不要求你先编译一个正则表达式对象，但错过了一些微调参数快捷键。

quotes'.)

'\x00'。

的特殊字符是：

'.'

标志被指定时，匹配任何字符包括换行符。

'^'

模式下每个换行符后面立即开始匹配。

'$'

will find two (empty) matches: one just before the newline, and one at the end of the string.

'*'

ab*将匹配'a'，'AB'或'a'后跟任意数目的'B的。

'+'

它不会匹配只是'一'。

'?'

ab?

??

<a>。

{m}

'a'个字符，但不是五个。

{m,n}

逗号可能不被省略或修改将与先前描述的形式相混淆。

{m,n}?

将只匹配3个字符。

'\'

特殊序列在下面讨论。

这是复杂的，很难理解，所以强烈建议您使用的所有原始字符串，但最简单的表达。

[]

在这个集合中：

'k'.
'-'.
'')''。
LOCALE模式是有效。
^有，如果它不是在集合的第一个字符没有特殊含义。
[]()[{}]都将匹配一个括号。

'|'

[|]。

(...)

[)]

(?...)

以下是当前支持的扩展。

(?aiLmsux)

re.compile()函数。

如果有前旗非空白字符，结果是不确定的。

(?:...)

be retrieved after performing a match or referenced later in the pattern.

(?P<name>...)

一个符号组也是一个带编号的组, 就好像这个组没有被命名一样.(注:除了原有的编号外再指定一个额外的别名).

使用单引号或双引号来匹配一个被引用的字符串):

参考上下文来组“引用”	如何引用它
以相同的模式本身	（如图所示） \1
m	m.group('quote') （等等。）
re.sub()	\g<quote> \g<1> \1

(?P=name)

它匹配任何文本是由早期的命名组匹配的名称。

(?#...)

括号中的内容被忽略。

(?=...)

'Asimov'

(?!...)

'Asimov'

(?<=...)

match()功能：

>>> import re
>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'

这个例子查找以下连字符的一句话：

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

改变在3.5版本中：增加支持固定长度的组引用。
(?<!...): 被搜索与负向断言开始可以在字符串的开头匹配模式。
(?(id/name)yes-pattern|no-pattern): 'user@host.com>'。

'$'。

\number

']'一个字符类，所有的数字逃逸被视为字符。

\A

仅在字符串的开始匹配。

\b

'foo3'

\b表示退格符，与python的字符串兼容。

\B

ASCII的标志。

\d

对于Unicode（STR）模式：: [0-9]可能是一个更好的选择）。
对于8位（字节）型态：: [0-9]。

\D

[^0-9]可能是一个更好的选择）。

\s

对于Unicode（STR）模式：: \t\n\r\f\v]
对于8位（字节）型态：: \t\n\r\f\v]

\S

\t\n\r\f\v]

\w

对于Unicode（STR）模式：: [a-zA-Z0-9_]可能是一个更好的选择）。
对于8位（字节）型态：: [a-zA-Z0-9_]。

\W

[^a-zA-Z0-9_]可能是一个更好的选择）。

\Z

仅在字符串的末尾匹配。

大多数被Python字符串支持的标准逃逸也由正则表达式解析器接受：

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

\b用于表示字边界的装置，以及仅内部字符类“退格”）。

在字节模式他们没有特殊处理。

对于字符串，八进制转义始终是最多三位数字的长度。

'\U'转义序列已被添加。

'\'和ASCII字母现在养不赞成警告，将在Python 3.6被禁止。

也可以看看
精通正则表达式
本书由杰弗里·弗里德尔正则表达式，由O'Reilly出版。该书的第二版不再涵盖Python的所有，但第一版涵盖了非常详细编写好的正则表达式模式。

模块内容

大多数复杂应用程序总是使用已编译的形式。

)

方法，如下所述。

运算符).

The sequence

prog = re.compile(pattern)
result = prog.match(string)

等同于

result = re.match(pattern, string)

和保存生成的正则表达式对象重用效率更高。

): 注意，这是从字符串中的某个时刻找到一个零长度的比赛不同。

)

注意，这是从零长度匹配不同。

re.match()将只匹配于字符串的开头，而不是在每一行的开头。

search() vs. match()).

): 注意，这是从零长度匹配不同。

在新版本3.4。

): 如果maxsplit非零，至多maxsplit分裂发生，并且该串的剩余部分作为返回列表的最后一个元件。

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

这同样适用于字符串的端：

>>>

>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

这样，分离器组件总是在结果列表中的相同相对索引找到。

Note

注意

例如：

>>>

>>> re.split('x*', 'axbc')
['a', 'bc']

在3.1版本的变化：增加了可选的标志参数。

现在的模式，只能匹配空字符串被拒绝。

>>>

>>> re.split("^$", "foo\n\nbar\n", flags=re.M)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
ValueError: split() requires a non-empty pattern match.

在3.1版本的变化：增加了可选的标志参数。

现在的模式，只能匹配空字符串被拒绝。

): Return all non-overlapping matches of pattern in string, as a list of strings. The string是从左到右扫描的，所以匹配的内容是按照该顺序来的If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

): 空场比赛都包括在结果，除非他们碰另一场比赛的开始。

)

例如：

>>>

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
...        r'static PyObject*\npy_\1(void)\n{',
...        'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'

例如：

>>>

>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'

该图案可以是字符串或RE对象。

'-a-b-c-'

\g<0>整个字符串替代由RE相匹配。

在3.1版本的变化：增加了可选的标志参数。

在3.5版本中改为：无与伦比的群体被替换为空字符串。

'\'与ASCII字母现在养不赞成警告，将在Python 3.6被禁止。

): number_of_subs_made)

在3.1版本的变化：增加了可选的标志参数。

在3.5版本中改为：无与伦比的群体被替换为空字符串。

): 如果你想匹配，可能有正则表达式元字符在它的任意文字字符串，这非常有用。

'_'字符不再逃跑。

): 清除正则表达式缓存。

)

错误实例具有以下附加属性：

msg: 未格式化的错误消息。

pattern: 正则表达式模式。

pos: 指数模式，其中编译失败。

lineno: 相对于线路POS。

colno: 相应于列POS。

在3.5版本中变化：增加了额外的属性。

正则表达式对象

已编译的正则表达式对象支持下列方法和属性︰

)

if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

>>>

>>> pattern = re.compile("d")
>>> pattern.search("dog")     # Match at index 0
<_sre.SRE_Match object; span=(0, 1), match='d'>
>>> pattern.search("dog", 1)  # No match; search doesn't include the "d"

)

None，则该字符串不匹配模式;注意这不同于一个零字节长度的匹配。

search()方法。

>>>

>>> pattern = re.compile("o")
>>> pattern.match("dog")      # 没有匹配的“O”是不是在“狗”的开始。
>>> pattern.match("dog", 1)   # 匹配为“O”是“狗”的第2个字符。
<_sre.SRE_Match object; span=(1, 2), match='o'>

search() vs. match()).

)

注意，这是从零长度匹配不同。

search()方法。

>>>

>>> pattern = re.compile("o[gh]")
>>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
>>> pattern.fullmatch("ogre")     # No match as not the full string matches.
>>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
<_sre.SRE_Match object; span=(1, 3), match='og'>

New in version 3.4.

): split()功能，采用编图案。

): match （）。

): match （）限制搜索区域。

): sub()功能，采用编图案。

): subn()功能，采用编图案。

flags: UNICODE如果该图案是Unicode字符串。

groups: 在图案捕获基团的数目。

groupindex: 如果没有符号组分别在模式中使用的字典是空的。

pattern: 该模式字符串从中RE对象被编译。

Match 对象

if语句：

match = re.search(pattern, string) 
if match:
process(match)

Match 对象支持下列方法和属性︰

）: \g<name>）是由相应的组的内容替换。

在3.5版本中改为：无与伦比的群体被替换为空字符串。

正则表达式实例

Checking for a Pair

在这个例子中，我们将使用下面的辅助函数来显示匹配的对象多了几分优雅：

def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

假设你正在编写一个球员的手表示为5个字符的字符串代表一个卡的每个字符扑克节目，“一”的王牌，“K”为王，“Q”为皇后，“J”的插孔， “T”为10，“2”至“9”表示与该值的卡。

要查看是否给定的字符串是一个有效的手，我们可以做到以下几点：

>>>

>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid.match("akt5q"))  # Valid.
"<Match: 'akt5q', groups=()>"
>>> displaymatch(valid.match("akt5e"))  # Invalid.
>>> displaymatch(valid.match("akt"))    # Invalid.
>>> displaymatch(valid.match("727ak"))  # Valid.
"<Match: '727ak', groups=()>"

要使用正则表达式匹配这一点，我们可以使用反向引用这样：

>>>

>>> pair = re.compile(r".*(.).*\1")
>>> displaymatch(pair.match("717ak"))     # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak"))     # No pairs.
>>> displaymatch(pair.match("354aa"))     # Pair of aces.
"<Match: '354aa', groups=('a',)>"

要找出对包括什么牌，我们可以用group()以下方式匹配对象的方法：

>>>

>>> pair.match("717ak").group(1)
'7'

# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    re.match(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'

>>> pair.match("354aa").group(1)
'a'

Simulating scanf()

scanf()格式标记和正则表达式。

Token	Regular Expression
%c	.
%5c	.{5}
%d	[-+]?\d+
%g	[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?
%i	[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)
%o	[-+]?[0-7]+
%s	\S+
%u	\d+
%X	[-+]?(0[xX])?[\dA-Fa-f]+

提取像一个字符串的文件名和号码

/usr/sbin/sendmail - 0 errors, 4 warnings

scanf()类似的格式

%s - %d errors, %d warnings

等效正则表达式会

(\S+) - (\d+) errors, (\d+) warnings

search() vs. match()

re.search()进行匹配的字符串中的任何检查（这是Perl并默认情况下）。

例如：

>>>

>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>

限制匹配的字符串的开头：

>>>

>>> re.match("c", "abcdef")    # No match没有匹配
>>> re.search("^c", "abcdef")  # No match
>>> re.search("^a", "abcdef")  # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>

'^'匹配在每个行的开始。

>>>

>>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>

Making a Phonebook通讯录

该方法是非常宝贵的，用于将文本数据转换为可以容易地读取和Python的改性如下面的示例，创建一个电话簿证明的数据结构。

通常，它可能来自一个文件，这里我们使用三引号字符串语法：

>>>

>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
...
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
... Frank Burger: 925.541.7625 662 South Dogwood Way
...
...
... Heather Albrecht: 548.326.4584 919 Park Place"""

现在我们转换成字符串具有其自身的条目中的每个非空行的列表：

>>>

>>> entries = re.split("\n+", text)
>>> entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']

因为地址有空格，我们的分裂模式，在其中：

>>>

>>> [re.split(":? ", entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

4, 我们可以分开街道名称门牌号码：

>>>

>>> [re.split(":? ", entry, 4) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

文字改写

sub()带有功能为“munge”的文字，或在随机除了第一个和最后一个字符一个句子中每个单词的所有字符的顺序：

>>>

>>> def repl(m):
...     inner_word = list(m.group(2))
...     random.shuffle(inner_word)
...     return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

找到所有副词

findall()的方式如下：

>>>

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

找到所有的副词及其位置

>>>

>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly", text):
...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly