Python requests-html session GET 正确用法答案

【问题标题】：Python requests-html session GET correct usagePython requests-html session GET 正确用法
【发布时间】：2020-11-02 19:52:49
【问题描述】：

我正在开发一个需要打开数千页并获取一些数据的网络爬虫。由于我最需要的数据字段之一只是在加载了网站的所有 javascript 之后才加载，所以我使用 html-requests 来呈现页面，然后获取我需要的数据。

我想知道，最好的方法是什么？

1- 在脚本开始时打开一个会话，进行整个抓取，然后在脚本完成数千次“点击”和几个小时后关闭会话？

2- 还是应该每次打开链接，渲染页面，获取数据，然后关闭会话，循环重复n次？

目前我正在做第二个选项，但我遇到了问题。这是我正在使用的代码：

def getSellerName(listingItems):
    for item in listingItems:
        builtURL = item['href']
        try:
            session = HTMLSession()
            r = session.get(builtURL,timeout=5)
            r.html.render()
            sleep(1)
            sellerInfo = r.html.search("<ul class=\"seller_name\"></ul></div><a href=\"{user}\" target=")["user"]
            ##
            ##Do some stuff with sellerinfo
            ##
            session.close()
        except requests.exceptions.Timeout:
            log.exception("TimeOut Ex: ")
            continue
        except:
            log.exception("Gen Ex")
            continue
        finally:    
            session.close()
        break

这很好用而且速度很快。但是，大约 1.5 或 2 小时后，我开始收到这样的操作系统异常：

OSError: [Errno 24] 打开的文件太多

然后就是这样，我只是一遍又一遍地得到这个异常，直到我杀死脚本。

我猜我需要在每次获取和渲染后关闭其他东西，但我不确定是什么或我做对了。

有什么帮助和/或建议吗？

谢谢！

【问题讨论】：

标签： python ubuntu web-scraping python-requests python-requests-html

【解决方案1】：

你应该在循环之外创建一个会话对象

def getSellerName(listingItems):
    session = HTMLSession()
    for item in listingItems:
    //code

【讨论】：