这个正则表达式会做你想做的事:
r'http://download\d+\.mysite\.com/\w+/\w+/upload\.rar'
\d 匹配数字,\w 匹配字母数字(包括下划线); + 表示匹配一个或多个先前的模式。我们在.com 和.rar 前面使用\,这样. 就可以按字面意思解释,而不是正则表达式通配符。
测试
import re
p = re.compile(r'http://download\d+\.mysite\.com/\w+/\w+/upload\.rar')
table = [
'http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.rar',
'http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.raw',
'http://download123.mysite.com/456/789/upload.rar',
'http://downloadabc.mysite.com/def/ghi/upload.rar',
'http://download1234.mysite.com/def/ghi/upload.rar',
'http://download1234.mysite.org/def/ghi/upload.rar',
]
for s in table:
m = p.match(s)
print s, m is not None
输出
http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.rar True
http://download2142.mysite.com/d0kz4p5p3uog/api60w0g1o1jil1/upload.raw False
http://download123.mysite.com/456/789/upload.rar True
http://downloadabc.mysite.com/def/ghi/upload.rar False
http://download1234.mysite.com/def/ghi/upload.rar True
http://download1234.mysite.org/def/ghi/upload.rar False
如果实际文件名不同,则可以使用
r'http://download\d+\.mysite\.com/\w+/\w+/\w+\.rar'
或
r'http://download\d+\.mysite\.com/\w+/\w+/[a-z]+\.rar'
如果名称总是小写字母
顺便说一句,它通常是not a good idea to parse HTML with regex,但如果页面格式是固定的并且相当简单,你也许可以摆脱它。