【问题标题】:How to get all html data after all scripts and page loading is done? (puppeteer)完成所有脚本和页面加载后如何获取所有 html 数据? (傀儡师)
【发布时间】:2019-06-30 23:36:39
【问题描述】:

最后我想出了如何使用 Node.js。安装了所有库/扩展。所以 puppeteer 正在工作,但是就像以前的 Xmlhttp 一样......它只获取页面的模板/正文,没有需要的信息。页面上的所有脚本都会在浏览器(Web 应用程序?)中打开几秒钟后启动。加载整个页面后,我需要在某些标签内获取信息。另外,我会问,是否可以使用纯 JavaScript,因为我不使用类似 jQuery 的代码。所以它对我来说难度增加了一倍......

这是我目前所拥有的。

const puppeteer = require('puppeteer');
const $ = require('cheerio');
let browser;
let page;

const url = "really long link with latitude and attitude";

(async () => puppeteer
  .launch()
  .then(await function(browser) {
    return browser.newPage();
})
  .then(await function(page) {
    return page.goto(url).then(function() {
      return page.content();
    });
  })
  .then(await function(html) {
    $('strong', html).each(function() {
      console.log($(this).text());
    });
  })
  .catch(function(err) {
    //handle error
  }))();

我只在强标签内获得模板默认正文元素。但它应该包含比 10 项更多的数据。

【问题讨论】:

  • 使用async/await 有点奇怪 then()。通常是const browser = await puppeteer.launch(); const page = await browser.newPage();...等

标签: javascript node.js parsing web-scraping puppeteer


【解决方案1】:

let bodyHTML = await page.evaluate(() => document.documentElement.outerHTML);

这个

【讨论】:

  • 但是这可能会回答问题,请找一些词来描述您的解决方案。
【解决方案2】:

如果您想要与检查相同的完整 html?这里是:

    const puppeteer = require('puppeteer');

    (async function main() {
      try {
        const browser = await puppeteer.launch();
        const [page] = await browser.pages();

        await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
        const data = await page.evaluate(() => document.querySelector('*').outerHTML);

        console.log(data);

        await browser.close();
      } catch (err) {
        console.error(err);
      }
    })();

【讨论】:

  • 这和await page.content()有什么不同?
【解决方案3】:

一些注意事项:

  1. 你不需要 cheeriopuppeteer 并且你不需要重新解析 page.content():你已经拥有运行所有脚本的完整 DOM,你可以像在浏览器中一样评估 window 上下文中的任何代码使用page.evaluate() 并在 Web API 上下文和 Node.js API 上下文之间传输可序列化数据。

  2. 尝试仅使用 async/await,这将简化您的代码和流程。

  3. 如果您需要等到所有脚本和其他依赖项都加载完毕,请在page.goto() 中使用waitUntil: 'networkidle0'

  4. 如果您怀疑文档脚本需要一些时间才能达到所需状态,请使用各种测试功能,例如 page.waitForSelector() 或退回到 page.waitFor(milliseconds)

这是一个输出页面中所有标签名称的简单脚本。

'use strict';

const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://example.org/', { waitUntil: 'networkidle0' });

    const data = await page.evaluate(
      () =>  Array.from(document.querySelectorAll('*'))
                  .map(elem => elem.tagName)
    );

    console.log(data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

您可以更详细地指定您的任务,我们可以尝试编写更合适的内容。


www.bezrealitky.cz 的脚本(来自下方评论的任务):

'use strict';

const fs = require('fs');
const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();
    page.setDefaultTimeout(0);

    await page.goto('https://www.bezrealitky.cz/vyhledat?offerType=pronajem&estateType=byt&disposition=&ownership=&construction=&equipped=&balcony=&order=timeOrder_desc&boundary=%5B%5B%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%2C%7B%22lat%22%3A50.154133576294%2C%22lng%22%3A14.599004629591036%7D%2C%7B%22lat%22%3A50.14524430128%2C%22lng%22%3A14.58773054712799%7D%2C%7B%22lat%22%3A50.129307131988%2C%22lng%22%3A14.60087568578706%7D%2C%7B%22lat%22%3A50.122604734575%2C%22lng%22%3A14.659116306376973%7D%2C%7B%22lat%22%3A50.106512499343%2C%22lng%22%3A14.657434650206028%7D%2C%7B%22lat%22%3A50.090685542974%2C%22lng%22%3A14.705099547441932%7D%2C%7B%22lat%22%3A50.072175921973%2C%22lng%22%3A14.700004206235008%7D%2C%7B%22lat%22%3A50.056898491904%2C%22lng%22%3A14.640206899053055%7D%2C%7B%22lat%22%3A50.038528576841%2C%22lng%22%3A14.666852728301023%7D%2C%7B%22lat%22%3A50.030955909657%2C%22lng%22%3A14.656128752460972%7D%2C%7B%22lat%22%3A50.013435368522%2C%22lng%22%3A14.66854956530301%7D%2C%7B%22lat%22%3A49.99444182116%2C%22lng%22%3A14.640153080292066%7D%2C%7B%22lat%22%3A50.010839032542%2C%22lng%22%3A14.527474219359988%7D%2C%7B%22lat%22%3A49.970771602447%2C%22lng%22%3A14.46224174052395%7D%2C%7B%22lat%22%3A49.970669964027%2C%22lng%22%3A14.400648545303966%7D%2C%7B%22lat%22%3A49.941901176098%2C%22lng%22%3A14.395563234671044%7D%2C%7B%22lat%22%3A49.948384148423%2C%22lng%22%3A14.337635637038034%7D%2C%7B%22lat%22%3A49.958376114735%2C%22lng%22%3A14.324977842107955%7D%2C%7B%22lat%22%3A49.9676286223%2C%22lng%22%3A14.34491711110104%7D%2C%7B%22lat%22%3A49.971859099005%2C%22lng%22%3A14.326815050839059%7D%2C%7B%22lat%22%3A49.990608728081%2C%22lng%22%3A14.342731259186962%7D%2C%7B%22lat%22%3A50.002211140429%2C%22lng%22%3A14.29483886971002%7D%2C%7B%22lat%22%3A50.023596577558%2C%22lng%22%3A14.315872285282012%7D%2C%7B%22lat%22%3A50.058309376419%2C%22lng%22%3A14.248086830069042%7D%2C%7B%22lat%22%3A50.073179111%2C%22lng%22%3A14.290193274400963%7D%2C%7B%22lat%22%3A50.102973823639%2C%22lng%22%3A14.224439442359994%7D%2C%7B%22lat%22%3A50.130060800171%2C%22lng%22%3A14.302396419107936%7D%2C%7B%22lat%22%3A50.116019827009%2C%22lng%22%3A14.360785349547996%7D%2C%7B%22lat%22%3A50.148005694843%2C%22lng%22%3A14.365662825877052%7D%2C%7B%22lat%22%3A50.14142969454%2C%22lng%22%3A14.394903042943952%7D%2C%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%2C%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%5D%5D&hasDrawnBoundary=1&mapBounds=%5B%5B%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.68724263943227%7D%2C%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.087801111111958%7D%2C%7B%22lat%22%3A50.039169221047985%2C%22lng%22%3A14.087801111111958%7D%2C%7B%22lat%22%3A50.039169221047985%2C%22lng%22%3A14.68724263943227%7D%2C%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.68724263943227%7D%5D%5D&center=%7B%22lat%22%3A50.16447196305031%2C%22lng%22%3A14.387521875272125%7D&zoom=11&locationInput=praha&limit=15');

    await page.waitForSelector('#search-content button.btn-icon');

    while (await page.$('#search-content button.btn-icon') !== null) {
      const articlesForNow = (await page.$$('#search-content article')).length;
      console.log(`Articles for now: ${articlesForNow}. Getting more...`);

      await Promise.all([
        page.evaluate(
          () => { document.querySelector('#search-content button.btn-icon').click(); }
        ),
        page.waitForFunction(
          old => document.querySelectorAll('#search-content article').length > old,
          {},
          articlesForNow
        ),
      ]);
    }

    const articlesAll = (await page.$$('#search-content article')).length;
    console.log(`All articles: ${articlesAll}.`);

    fs.writeFileSync('full.html', await page.content());
    fs.writeFileSync('articles.html', await page.evaluate(
      () => document.querySelector('#search-content div.b-filter__inner').outerHTML
    ));
    fs.writeFileSync('articles.txt', await page.evaluate(
      () => [...document.querySelectorAll('#search-content article')]
              .map(({ innerText }) => innerText)
              .join(`\n${'-'.repeat(50)}\n`)
    ));
    console.log('Saved.');

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

【讨论】:

  • 谢谢,这行得通,但我还有另一个问题。在页面上,有一个按钮,我需要按下它来获取更多项目,我该怎么做?而且,如果可能的话,我想获取包含所有数据的 html,并自己通过 queryselector 对其进行解析,这对我来说会容易得多。
  • 这取决于按钮点击效果:它是开始导航、发送 fetch 或 XHR 请求还是只是进行一些动态 DOM 操作。至于第二个问题,我不确定我是否理解这个问题。也许您可以提供 URL 并描述您需要实现的目标?
  • tinyurl.com/y9vgf2h7 所有公寓报价下方都有按钮,可以加载更多。我想用 All appartments offer 获取这个页面的 HTML,稍后用 querySelector 解析它。
  • 您的意思是“Zobrazit dalších 15 nabídek”按钮吗?您想点击它直到显示所有优惠吗?我已经点击了好几次,这个列表还在增长。这个列表的增长是有限的吗?
  • 是的,这个按钮。我认为它已经结束了:)。至少我记得有。
猜你喜欢
  • 2019-04-28
  • 1970-01-01
  • 1970-01-01
  • 2020-11-01
  • 2015-09-28
  • 2020-09-25
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多