将正则表达式应用于 urlopen 请求答案

【问题标题】：Apply regexp to urlopen request将正则表达式应用于 urlopen 请求
【发布时间】：2021-04-01 03:17:49
【问题描述】：

我正在尝试在 urlopen(req) 的结果页面上应用正则表达式过滤器：

from urllib.request import urlopen, Request
import re
from contextlib import closing

req = Request('https://yts-subs.com/movie-imdb/tt1483013')
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36')
webpage = urlopen(req)
encoding = webpage.headers.get_content_charset('charset')

# page = str(webpage.read(), encoding)
page = webpage.read().decode('utf-8')

pattern = re.compile(r'<tr data-id=".*?"(?: class="((?:high|low)-rating)")?>\s*<td class="rating-cell">\s*.*</span>\n\s*</td>\n\s*<td class.*\n\s*<span.*>.*</span>\n\s*<span class="sub-lang">(.*)?</span>\n\s*</td>\n\s*<td>\n\s*<a href="(.*)?">'
                     ,re.UNICODE)
print(pattern.findall(page))

但由于某些原因，它不匹配任何东西。模式应该没问题，我单独测试过，读取的页面存在。怀疑编码错误，我尝试 str() 或解码它但没有太大成功。令我困惑的是：如果我编写一个中间文件并读取它，它就可以工作......

在模式之前添加它使其工作：

with open('temp.data', 'w') as data:
  data.write(page)
page = ''
with open('temp.data','r') as data:
  page=''.join(data.readlines())

显然我做错了什么，我将不胜感激！

【问题讨论】：

您可能需要添加多行标志，检查docs.python.org/3/library/re.html#re.MULTILINE
检查stackoverflow.com/a/1732454/4046632
感谢你们俩的回答，但正如我所说，如果从文件中读取应用的字符串，则正则表达式按预期工作。我尝试了 re.MULTILINE 标志，但效果不大。
试试HTML parser, BeautifulSoup 代替正则表达式...
谢谢@JosefZ，我可以试试！

标签： python encoding re urlopen python-3.9

【解决方案1】：

好的，原来我的正则表达式模式是问题所在。通过更精确地重写它，它起作用了。这是一个很好的模式：

pattern = re.compile(r'<tr data-id=".*?"(?: class="((?:high|low)-rating)")?>\s*<td class="rating-cell">\s*.*</span>\s*</td>\s*<td class.*\s*<span.*>.*</span>\s*<span class="sub-lang">(.*)?</span>\s*</td>\s*<td>\s*<a href="([^">]*)?')

和错误的比较：

pattern = re.compile(r'<tr data-id=".*?"(?: class="((?:high|low)-rating)")?>\s*<td class="rating-cell">\s*.*</span>\n\s*</td>\n\s*<td class.*\n\s*<span.*>.*</span>\n\s*<span class="sub-lang">(.*)?</span>\n\s*</td>\n\s*<td>\n\s*<a href="(.*)?">'
                     ,re.UNICODE)

感谢您的帮助，我将调查已回答的替代方案！

【讨论】：