【问题标题】:html-requests, skip if TimeoutError when rendering HTMLhtml-requests,如果渲染 HTML 时出现 TimeoutError 则跳过
【发布时间】:2021-07-16 03:02:02
【问题描述】:

我正在编写一个使用 HTML 请求的网页抓取脚本。我抓取 URL,然后运行它们并提交到数据库。我已经能够抓取链接并创建了一个 for 循环来呈现页面,然后抓取特定的产品信息。对于大多数链接,这有效,但对于某些链接,页面不会呈现,我得到一个pyppeteer.errors.TimeoutError。我可以不抓取一些链接,因为大多数网站信息都被抓取了。我尝试过使用 try ,但如下所示:

    session = HTMLSession()
    for link in productlinks2:
        r = session.get(link)
        try:
            r.html.render(sleep=3, timeout=30)
        except TimeoutError:
            pass

但这仍然会产生:

pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.

是否可以跳过无法及时呈现的链接?任何帮助将不胜感激。

【问题讨论】:

  • .get 不应该在 try 中吗?

标签: python html web-scraping rendering python-requests-html


【解决方案1】:

你导入你的错误吗?

那么你也需要为你的session.get() 设置超时时间

这取决于你的错误,但是,如果你有一个错误的 url,你会在渲染页面之前从 session.get() 得到一个错误。 因此,例如查看可以捕获的不同错误:

from requests_html import HTMLSession
from requests.exceptions import ConnectionError, InvalidSchema, ReadTimeout
from pyppeteer.errors import TimeoutError

session = HTMLSession()

links = [
    'https://www.google.com/',
    'h**ps://www.google.com/',
    'https://deelay.me/4000/https://www.google.com/', # 4s of delay to get the page
    'https://www.baaaadurl.com/', 
    'https://www.youtube.com/', 
    'https://www.google.com/',

]

for url in links:
    try:
        r = session.get(url, timeout=3)
        r.html.render(timeout=1) # timout short to render google but not youtube
        print(r.html.find('title', first=True).text, '\n')
    except InvalidSchema as e:
        # error for 'h**ps://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ReadTimeout as e:
        # error due to too much delay for 
        # 'https://deelay.me/4000/https://www.google.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except ConnectionError as e:
        # error for 'https://www.baaaadurl.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    except TimeoutError as e:
        # error if timout 
        # in rendering the page 'https://www.youtube.com/'
        print(f'For the url "{url}" the error is: {e} \n')
        pass
    

打印结果:

Google 

For the url "h**ps://www.google.com/" the error is: No connection adapters were found for 'h**ps://www.google.com/' 

For the url "https://deelay.me/4000/https://www.google.com/" the error is: HTTPSConnectionPool(host='deelay.me', port=443): Read timed out. (read timeout=3) 

For the url "https://www.baaaadurl.com/" the error is: HTTPSConnectionPool(host='www.baaaadurl.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2596ba6460>: Failed to establish a new connection: [Errno -2] Name or service not known')) 

For the url "https://www.youtube.com/" the error is: Navigation Timeout Exceeded: 1000 ms exceeded. 

Google 

这样您就可以捕获错误并继续循环。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2017-03-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-11-13
    • 1970-01-01
    • 2012-11-17
    相关资源
    最近更新 更多