Python 从 URL 抓取 youtube 标题的速度太慢 - html-render答案

【问题标题】：Python scraping too slow on youtube title from URL - html-renderPython 从 URL 抓取 youtube 标题的速度太慢 - html-render
【发布时间】：2021-08-23 12:45:45
【问题描述】：

嗨，我有带有 youtube url 列表的 excel 文件，我试图获取它们的标题，因为它是 1000 个带有 3 个 excel 文件的完整列表，我尝试使用 python，但它太慢了，因为我不得不输入 sleep 命令html 渲染代码是这样的：

 import xlrd
import time
from bs4 import BeautifulSoup
import requests
from xlutils.copy import copy
from requests_html import HTMLSession



loc = ("testt.xls")

wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
wb2 = copy(wb)
sheet.cell_value(0, 0)

for i in range(3,sheet.nrows):


    ytlink = (sheet.cell_value(i, 0))
    session = HTMLSession()
    response = session.get(ytlink)
    response.html.render(sleep=3)
    print(sheet.cell_value(i, 0))
    print(ytlink)
    element = BeautifulSoup(response.html.html, "lxml")
    media = element.select_one('#container > h1').text
    print(media)
    s2 = wb2.get_sheet(0)
    s2.write(i, 0, media)
    wb2.save("testt.xls")

我的意思是无论如何让它更快我尝试了硒，但我猜它更慢。并且有了这个 html.render 我似乎需要使用“睡眠”计时器，否则它会给我错误我在睡眠时尝试了较低的值，但是在较低的睡眠值上一段时间后它会出错任何帮助，谢谢:)

ps：我放的打印只是为了检查输出，对使用并不重要。

【问题讨论】：

标签： python python-3.x beautifulsoup python-requests python-requests-html

【解决方案1】：

使用您当前的方法/Selenium，您正在渲染实际的网页，您不需要这样做。我建议使用可以为您处理它的 Python 库。下面是一个 YoutubeDL 的例子：

with YoutubeDL() as ydl:
    title = ydl.extract_info("https://www.youtube.com/watch?v=jNQXAC9IVRw", download=False).get("title", None)
    print(title)

请注意，在 YouTube 施加的速率限制下，执行 1000 个此类请求仍然会很慢。如果您计划在未来进行可能的数千个s请求，我建议您查看getting an API key。

【讨论】：

哦，我看到了，但我认为只有使用 api 令牌密钥才能使用它。我猜速率限制可以减慢它的速度，但它仍然会比以前的方法更快，我猜对吧？只要它没有超时或excel中间的东西让我重新开始，我会有点慢。因为没有 api 或 render 似乎没有其他方法可以采取谢谢你的帮助我会尝试:)

【解决方案2】：

您可以使用 async requests-html 在不到一分钟的时间内完成 1000 个请求，如下所示：

import random
from time import perf_counter
from requests_html import AsyncHTMLSession

urls = ['https://www.youtube.com/watch?v=z9eoubnO-pE'] * 1000

asession = AsyncHTMLSession()
start = perf_counter()

async def fetch(url):
    r = await asession.get(url, cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))})
    return r

all_responses = asession.run(*[lambda url=url: fetch(url) for url in urls])
all_titles = [r.html.find('title', first=True).text for r in all_responses]

print(all_titles)
print(perf_counter() - start)

在我的笔记本电脑上在 55 秒内完成。

请注意，您需要将cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))} 传递给请求以避免this issue。

【讨论】：

试过了，它在 for 循环中的工作速度比以前更快，可以在该列的 excel 行中获取每个 url，并将其替换为该 youtube url 的标题，谢谢 :) 代码如下，用于通知像我这样的新手：