【问题标题】:How to render asynchronous page with requests-html in a multithreaded environment?如何在多线程环境中使用 requests-html 呈现异步页面?
【发布时间】:2019-02-19 14:45:13
【问题描述】:

为了为动态加载内容的页面创建爬虫,requests-html 提供了模块来获取 JS 执行后呈现的页面。但是,当尝试通过在多线程实现中调用 arender() 方法来使用 AsyncHTMLSession 时,生成的 HTML 不会改变。

例如在源代码中提供的 URL 中,表格 HTML 值默认为空,在脚本执行后,由 arender() 方法模拟,预计会将值插入标记中,尽管在源中没有注意到可见的变化代码。

from pprint import pprint

#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor

from requests_html import AsyncHTMLSession, HTML

async def fetch(session, url):
    r = await session.get(url)
    await r.html.arender()
    return r.content

def parseWebpage(page):
    print(page)

async def get_data_asynchronous():  
    urls = [
        'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
    ]  

    with ThreadPoolExecutor(max_workers=20) as executor:
        with AsyncHTMLSession() as session:
            # Set any session parameters here before calling `fetch` 

            # Initialize the event loop        
            loop = asyncio.get_event_loop()

            # Use list comprehension to create a list of
            # tasks to complete. The executor will run the `fetch`
            # function for each url in the urlslist
            tasks = [
                await loop.run_in_executor(
                    executor,
                    fetch,
                    *(session, url) # Allows us to pass in multiple arguments to `fetch`
                )
                for url in urls
            ]

            # Initializes the tasks to run and awaits their results
            for response in await asyncio.gather(*tasks):
                parseWebpage(response)

def main():
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(get_data_asynchronous())
    loop.run_until_complete(future)

main()

【问题讨论】:

    标签: python multithreading web-scraping python-requests-html


    【解决方案1】:

    渲染方法执行后的源代码表示不在会话的content属性下,而是在HTML对象中的raw_html下。在这种情况下,返回的值应该是r.html.raw_html

    【讨论】:

      猜你喜欢
      • 2018-09-23
      • 2018-08-04
      • 1970-01-01
      • 2021-11-06
      • 2013-12-03
      • 1970-01-01
      • 1970-01-01
      • 2014-01-27
      • 1970-01-01
      相关资源
      最近更新 更多