如何使用正则表达式来帮助抓取 Web 数据？答案

【问题标题】：How can I use regex to help scrape web data?如何使用正则表达式来帮助抓取 Web 数据？
【发布时间】：2021-07-27 00:22:30
【问题描述】：

我正在尝试在单个 youtube 视频页面上获取 URL。 youtube-dl 可以做到这一点，但我只需要 url，所以我想学习如何做到这一点。

获取页面源代码是我的代码：source = requests.get("https://www.youtube.com/watch?v=zXif_9RVadI")

我正在寻找 21. 这一行代码：source_line_21 = source.text.split("\n")[20]

所有以https://r[0-9] 开头并包括googlevideo.com/videoplayback 并以"," 结尾的网址

我尝试了很多代码，但总是得到 0 或 1 个匹配项。但是有 15-20 场比赛。

re.match(r'https:\/\/.*googlevideo.com/videoplayback.*mimeType', source_line_21)

我不擅长正则表达式，我学不好。谢谢大家。

print(source_line_21)[:32600] 的输出我在这里搜索。太长了，贴到那里：print(source_line_21)[:32600]

【问题讨论】：

请附上minimal reproducible example。提取包含链接的原始 HTML 的 sn-p。将其硬编码为变量。然后使用该变量而不是 source.text 并创建一个示例，人们可以将其复制到他们的环境中并运行以重现您的问题。
re.match() 只查找字符串开头的模式。使用re.search()
@PranavHosangadi source.text 太长，我不想在这里粘贴。如果人们想在他们的环境中尝试，我认为使用我的代码是更好更快的尝试方式。我尝试了re.search()，但同样如此，0 匹配。感谢您的评论。
"source.text is too long" 我知道，这就是为什么我要求提供与此处相关的 sn-p。
我以为你想要类似regex101.com/r/I4qd8t/2

标签： python python-3.x regex re

【解决方案1】：

您要执行的操作稍微复杂一些；但可以通过使用下面列出的几个工具来简化。

我在示例中使用了urllib，因为我的requests 请求带回了Google 的“在您继续使用YouTube 之前”cookie 确认页面，但urllib 允许我绕过这些垃圾。

工具：

urllib（或）requests
BeautifulSoup - 通过bs4library
正则表达式 - 通过re 库
JSON - 通过 json 库

逻辑：

抓取网站数据
使用 BeautifulSoup 解析 HTML
提取感兴趣的标签
遍历标签并使用正则表达式查找感兴趣的 JavaScript 变量
遍历变量的内容（使用 JSON）以获取 URL

代码：

# Using urllib to read site content. 
source = urllib.request.urlopen("https://www.youtube.com/watch?v=zXif_9RVadI").read().decode()
# Parse HTML using BeautifulSoup
soup = bs(source, features='html.parser')
# Extract all <script> tags.
scripts = soup.findAll('script')
# Build regex pattern to extract the <script> tag's content.
exp = re.compile(r'^var\sytInitialPlayerResponse\s=\s(?P<content>.*\})')

# Iterate through all scripts to find the one with video content.
for s in scripts:
    if s.string:
        m = re.match(exp, s.string)
        if m:
            data = m.groupdict().get('content')

# Extract <script> of interest's content into JSON format.
content = json.loads(data)

# Collect all URIs into a list.
urls = []
for fmt in ['formats', 'adaptiveFormats']:
    for ele in content['streamingData'][fmt]:
        urls.append(ele['url'])

确认 URI：

# Print the detected URIs:
for i, url in enumerate(urls, 1):
    print(i, url[:75])

1 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
2 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
3 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
4 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
5 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
6 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
7 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
8 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
9 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
10 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
11 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
12 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
13 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
14 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
15 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
16 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202

【讨论】：

非常感谢，它正在工作，我正在尝试理解您的代码：D
这段代码for ele in content['streamingData'][fmt]:给出JSON输出，并且有更多有组织的数据。
我的荣幸。很高兴听到您觉得它很有帮助。

【解决方案2】：

使用它

re.match(r'https:\/\/r[0-9][\w\-@%]*googlevideo.com/videoplayback","$', source_line_21)

【讨论】：

【解决方案3】：

我找到了解决方案，但实际上并不是我想要的。

import re
from urlextract import URLExtract

source = requests.get("https://www.youtube.com/watch?v=zXif_9RVadI")

source_line_21 = source.text.split("\n")[20]

sonuc = re.findall('https:\/\/r[0-9].*\SmimeType', source_line_21)

extractor = URLExtract()
aa = [x for x in extractor.find_urls(sonuc[0]) if "mime=audio" in x]

此代码将为我提供 mime=audio 格式的所有 URL。我使用了 URLExtract 模块，它是外部的，不是内置的。所以，我仍在寻找更好的方法来解决我的问题。

【讨论】：

【解决方案4】：

你可以使用

re.findall(r'https://r[0-9][^"]*', text)
re.findall(r'https://r[0-9][^"]*', text, re.I)  # case insensitive

请参阅regex demo。

详情

https:// - https:// 字符串（如果您也想匹配 http://，请在 s 之后添加 ?：https?://）
r - 一个 r 字符
[0-9] - 一个数字
[^"]* - 除了" 字符之外的零个或多个字符。

【讨论】：