Python 网页抓取 Javascript 与 Await答案

【问题标题】：Python webscraping Javascript with AwaitPython 网页抓取 Javascript 与 Await
【发布时间】：2022-02-03 12:01:53
【问题描述】：

我有一个关于使用 Python 进行网页抓取的问题。我正在尝试使用from requests_html import AsyncHTMLSession 从https://www.nyse.com/ipo-center/filings 的第一个表中获取数据。

我的代码在这里：

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
r = await session.get(url)
await r.html.arender()

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(r.html.html, "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find('table', class_='table table-data table-condensed spacer-lg')

现在我有两个问题：

网站通常不会从table1 返回任何有效信息，因此我无法获得表内的基础信息。到目前为止，我通过简单地等待几秒钟来绕过它，然后再次运行循环，直到加载数据帧。不过可能不是最佳选择。
该代码在 Jupyter Notebook 中确实可以工作，但是一旦我以 .py 格式将其上传到我的服务器上，我就会收到 SyntaxError: 'await' outside async function 的错误消息。

有没有人能解决上面提到的两个问题？

【问题讨论】：

stackoverflow.com/questions/59130200/…
不确定#1，但对于#2，您可以通过将逻辑放入async 函数来解决此问题。

标签： python web-scraping async-await

【解决方案1】：

由于您使用的是协程，因此您需要将它们包装在 async 函数中。见下例

from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession

#first define the URL and start the session
url = 'http://www.nyse.com/ipo-center/filings'
session = AsyncHTMLSession()

#then get the URL content, and load the html content after parsing through the javascript
async def get_page():
    r = await session.get(url)
    await r.html.arender(timeout=20)
    return r.text

data = session.run(get_page)

#then we create a beautifulsoup object based on the rendered html
soup = BeautifulSoup(data[0], "lxml")

#then we find the first datatable, which is the one that contains upcoming IPO data
table1 = soup.find_all('table', class_='table table-data table-condensed spacer-lg')
print(table1)

【讨论】：

感谢您的意见！该脚本在我运行 Ubuntu 的服务器上运行，该文件使用 .py，但它不适用于本地笔记本上的 Jupyter Notebook。它使用错误“此事件循环已在运行”。你知道为什么吗？
这是众所周知的，但它与 jupter 的设置有关。如果您进行一些挖掘，堆栈上有几个问题可以解释这一点。最好在除了 jupyter 的 IDE 上使用异步