正则表达式:
re 模块
import re
re.match(pattern,str): 从左边开始匹配,只要匹配失败,就退出
re.search(pattern,str): 从左边开始匹配,如果匹配到第一个,则不再继续匹配
re.findall(pattern,str): 从左边开始匹配,直到匹配完所有满足条件的,并返回一个满足匹配条件的列表
re.sub(pattern,新内容,str): 替换
基础:
[]: 范围
.: 任意字符
|: 或者
(): 一组
量词:
*: >=0
+: >=1
?: 0,1
{m}: =m
{m,}: >=m
{m,n}: [m,n]
预定义:
\s space
\S not space
\d digit
\D not digit
\w word [0-9a-zA-Z_]
\W not word [^0-9a-zA-Z_]
\b
\B
分组:
() ----> group(1)
number
(\w+)(\d) ----> group(1) group(2)
引用:
(\w+)(\d) \1 \2 表示引用前面的内容
name
(?\w+) (?P=name)
贪婪匹配:
Python里数量词默认是贪婪的(在少数语言里也可能是默认非贪婪),总是尝试匹配尽可能多的字符;
非贪婪则相反,总是尝试匹配尽可能烧的字符.
在"*","?","+","{m,n}"后面加上?,使贪婪变成非贪婪
# 大写字母 [A-z]
msg = \'FKRITOFLSDKFWWPGVL\'
result = re.match(r\'[A-Z]+\', msg)
print(result)
# 小写字母 [a-z]
msg = \'sdfwsdfsfsf\'
result = re.match(r\'[a-z]+\', msg)
print(result)
# 数字 [0-9] 或者 \d
msg = \'334322341098\'
result = re.match(r\'\d+\', msg)
print(result)
# 带区位的电话号码 电话号码是5~11位,且不能是0开头
msg = \'020-43948574\'
result = re.match(r\'(\d{3}|\d{4})-([1-9]\d{4,10})\', msg)
print(result)
area_num = result.group(1)
phone_num = result.group(2)
print(\'区号:{},电话:{}\'.format(area_num, phone_num))
# 手机号码 1开始, 3,5,7,8为第二位,11位数字
msg = \'18665028070\'
result = re.match(r\'1[3578]\d{9}$\', msg)
print(result)
# 邮箱 qq,126,163,139 4lkjl2lj234l@qq.com
msg = \'4223lsds2l_42@139.cn\'
result = re.match(r\'\w{5,15}@(qq|126|163|139)\.(com|cn)\', msg)
print(result)
# HTML标签
# 取名的用法 ?P<name> ?p=name
msg = \'<html><div><a>百度一下就知道了</a></div></html>\'
result = re.match(r\'(<(?P<tag1>[0-9a-zA-Z]+)>(.*)</(?P=tag1)>)\', msg)
print(result)
print(result.group())
print(\'0---\', result.group(0))
print(\'1---\', result.group(1))
print(\'2---\', result.group(2))
print(\'3---\', result.group(3))
# sub 把所有的分数都加1
msg = \'001:91,002:99,003:95\'
def func(pattern):
match = pattern.group(1)
temp1 = pattern.group(2)
temp2 = int(temp1) + 1
return match.replace(temp1, str(temp2))
result = re.sub(r\'(:(\d+),?)\', func, msg)
print(result)
# split 分割
msg = \'001:91,002:99,003:95\'
result = re.split(r\'[:,]\', msg)
print(result)
# 贪婪与非贪婪
msg = \'abc1234abc\'
result = re.match(r\'abc(\d+)\', msg) # 贪婪
result2 = re.match(r\'abc(\d+?)\', msg) # 非贪婪
print(result)
print(result2)