使用 node.js 和无头浏览器抓取动态页面答案

【问题标题】：Scraping dynamic pages with node.js and headless browser使用 node.js 和无头浏览器抓取动态页面
【发布时间】：2023-04-01 17:38:02
【问题描述】：

我正在尝试从动态加载的页面中删除数据。为此，我正在使用无头浏览器 puppeteer

Puppeteer 可以看作代码中的headlessBrowserClient。

主要挑战是在收到所需数据后立即优雅地关闭浏览器。但是，如果您在 evaluateCustomCode 执行完成之前关闭它 - evaluateCustomCode 进度将会丢失。

evaluateCustomCode 是一个可以像在 Chrome 开发工具中运行一样调用的函数。

为了控制 puppeteer API 的网络请求和异步流程 - 我使用了封装了上述所有逻辑的异步生成器。

问题是我觉得代码有异味，但我找不到更好的解决方案。

想法？

module.exports = function buildClient (headlessBrowserClient) {
  const getPageContent = async (pageUrl, evaluateCustomCode) => {
    const request = sendRequest(pageUrl)
    const { value: page } = await request.next()

    if (page) {
      const pageContent = await page.evaluate(evaluateCustomCode)
      request.next()

      return pageContent
    }
  }

  async function * sendRequest (url) {
    const browser = await headlessBrowserClient.launch()
    const page = await browser.newPage()

    const state = {
      req: { url },
    }

    try {
      await page.goto(url)
      yield page
    } catch (error) {
      throw new APIError(error, state)
    } finally {
      yield browser.close()
    }
  }

  return {
    getPageContent,
  }
}

【问题讨论】：

标签： javascript node.js web-scraping puppeteer headless-browser

【解决方案1】：

您可以将waitForFunction 或waitFor 和evaluate 与Promise.all 一起使用。无论网站多么动态，您都在等待最终结果为真，并在发生这种情况时关闭浏览器。

由于我无法访问您的动态网址，因此我将使用一些随机变量和延迟作为示例。一旦变量返回真值，它将解决。

await page.waitForFunction((()=>!!someVariableThatShouldBeTrue);

如果您的动态页面在您评估代码后实际上在某处创建了一个选择器？在这种情况下，

await page.waitFor('someSelector')

现在回到您的 customCode，让我为您重命名一下，

await page.evaluate(customCode)

customCode 是在某处将变量 someVariableThatShouldBeTrue 设置为 true 的东西。老实说，它可以是任何东西，一个请求，一个字符串或任何东西。无限可能。

你可以在 page.evaluate 中放一个 promise，最近的 chromium 非常支持它们。因此，以下内容也将起作用，一旦加载函数/数据即可解决。确保 customCode 是异步函数或返回 promise。

const pageContent = await page.evaluate(CustomCode);

好的，现在我们已经有了所有需要的部分。我稍微修改了代码，所以它对我来说没有味道：D，

module.exports = function buildClient(headlessBrowserClient) {
  return {
    getPageContent: async (url, CustomCode) => {
      const state = {
        req: { url },
      };
      // so that we can call them on "finally" block
      let browser, page;
      try {
        // launch browser
        browser = await headlessBrowserClient.launch()
        page = await browser.newPage()
        await page.goto(url)

        // evaluate and wait for something to happen
        // first element returns the pageContent, but whole will resolve if both ends truthy
        const [pageContent] = await Promise.all([
          await page.evaluate(CustomCode),
          await page.waitForFunction((() => !!someVariableThatShouldBeTrue))
        ])

        // Or, You realize you can put a promise inside page.evaluate, recent chromium supports them very well
        // const pageContent = await page.evaluate(CustomCode)

        return pageContent;
      } catch (error) {
        throw new APIError(error, state)
      } finally {
        // NOTE: Maybe we can move them on a different function
        await page.close()
        await browser.close()
      }
    }
  }
}

您可以根据需要进行更多更改和调整。我没有测试最终代码（因为我没有 APIError、evaluateCustomCode 等），但它应该可以工作。

它没有所有这些生成器和类似的东西。 Promises，这就是处理动态页面的方式：D。

PS：IMO，这样的问题更适合the code review。

【讨论】：