html-requests，如果渲染 HTML 时出现 TimeoutError 则跳过答案

【问题标题】：html-requests, skip if TimeoutError when rendering HTMLhtml-requests，如果渲染 HTML 时出现 TimeoutError 则跳过
【发布时间】：2021-07-16 03:02:02
【问题描述】：

我正在编写一个使用 HTML 请求的网页抓取脚本。我抓取 URL，然后运行它们并提交到数据库。我已经能够抓取链接并创建了一个 for 循环来呈现页面，然后抓取特定的产品信息。对于大多数链接，这有效，但对于某些链接，页面不会呈现，我得到一个pyppeteer.errors.TimeoutError。我可以不抓取一些链接，因为大多数网站信息都被抓取了。我尝试过使用 try ，但如下所示：

    session = HTMLSession()
    for link in productlinks2:
        r = session.get(link)
        try:
            r.html.render(sleep=3, timeout=30)
        except TimeoutError:
            pass

但这仍然会产生：

pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.

是否可以跳过无法及时呈现的链接？任何帮助将不胜感激。

【问题讨论】：

.get 不应该在 try 中吗？

标签： python html web-scraping rendering python-requests-html

【解决方案1】：

你导入你的错误吗？

那么你也需要为你的session.get() 设置超时时间

这取决于你的错误，但是，如果你有一个错误的 url，你会在渲染页面之前从 session.get() 得到一个错误。因此，例如查看可以捕获的不同错误：

from requests_html import HTMLSession
from requests.exceptions import ConnectionError, InvalidSchema, ReadTimeout
from pyppeteer.errors import TimeoutError

session = HTMLSession()

links = [
    'https://www.google.com/',
    'h**ps://www.google.com/',
    'https://deelay.me/4000/https://www.google.com/', # 4s of delay to get the page
    'https://www.baaaadurl.com/', 
    'https://www.youtube.com/', 
    'https://www.google.com/',

]

for url in links:
    try:
        r = session.get(url, timeout=3)
        r.html.render(timeout=1) # timout short to render google but not youtube
        print(r.html.find('title', first=True).text, '\n')
    except InvalidSchema as e:
        # error for 'h**ps://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ReadTimeout as e:
        # error due to too much delay for 
        # 'https://deelay.me/4000/https://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ConnectionError as e:
        # error for 'https://www.baaaadurl.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except TimeoutError as e:
        # error if timout 
        # in rendering the page 'https://www.youtube.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass

打印结果：

Google 

For the url "h**ps://www.google.com/" the error is: No connection adapters were found for 'h**ps://www.google.com/' 

For the url "https://deelay.me/4000/https://www.google.com/" the error is: HTTPSConnectionPool(host='deelay.me', port=443): Read timed out. (read timeout=3) 

For the url "https://www.baaaadurl.com/" the error is: HTTPSConnectionPool(host='www.baaaadurl.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2596ba6460>: Failed to establish a new connection: [Errno -2] Name or service not known')) 

For the url "https://www.youtube.com/" the error is: Navigation Timeout Exceeded: 1000 ms exceeded. 

Google

这样您就可以捕获错误并继续循环。

【讨论】：