Puppeteer-cluster 使用选项卡并截图答案

【问题标题】：Puppeteer-cluster using tab and taking screenshotPuppeteer-cluster 使用选项卡并截图
【发布时间】：2019-06-25 20:52:17
【问题描述】：

我正在使用 puppeteer-clustor 和 imagemagick (convert) / xwd 命令来截取完整的桌面。

需要具有页面可见部分的浏览器以及浏览器导航按钮和 URL。大多数时候我可以得到截图，但其他时候它确实失败了。

错误消息是选项卡已关闭屏幕截图已完成。请提出我做错了什么。

代码在 linux 上运行，X 在 DISPLAY:0.3 上运行。我可以看到

下面是我尝试过blockingWait的代码

const {
  Cluster
} = require('puppeteer-cluster');
const execSync = require('child_process').execSync;

process.env['DISPLAY'] = ':0.3';
let i = 0;

function wait(time) {
  return new Promise((resolve) => setTimeout(resolve, time));
}

function blockingWait(seconds) {
  //simple blocking technique (wait...)
  var waitTill = new Date(new Date().getTime() + seconds * 1000);
  while (waitTill > new Date()) {}
}

function getscreenshot(url, page) {
  page.bringToFront(); // Get the tab to focus 
  wait(200);
  i = i + 1; // For now get screenshot as number will add image named based on URL 
  path = i + '.jpg';
  var r = execSync('import -window root ' + path);
  console.log('Taken screenshot: ' + path);
  console.log(url);
  blockingWait(1);
}

(async () => {
  // Create a cluster with 6 workers or 6 tabs which loads all the url
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_PAGE,
    maxConcurrency: 6,
    timeout: 120000,
    puppeteerOptions: {
      executablePath: 'google-chrome-stable',
      args: [
        '--ignore-certificate-errors',
        '--no-sandbox',
        '--incognito',
        '--disable-infobars',
        '--disable-setuid-sandbox',
        '--window-size=1600,1200',
        '--start-maximized',
        '--disable-gpu'
      ],
      headless: false, //headless:false so we can watch the browser as it works
    },
  });
  console.log('cluster launched');

  // We don't define a task and instead use own functions
  const screenshot = async ({
    page,
    data: url
  }) => {
    console.log('screenshot entered ');
    await page.setExtraHTTPHeaders({
      'CUSTOMER-ID': "66840"
    }, ); // use same customer id as header
    await page.setViewport({
      width: 1600,
      height: 1200
    });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3419.0 Safari/537.36');
    await page.goto(url, {
      waitUntil: 'domcontentloaded'
    }, {
      waitUntil: 'networkidle0'
    }, {
      waitUntil: 'load'
    });
    // Since we wait the page to fully load

    await page.waitForResponse(response => response.ok()) // ok page is ready .. will deal here for other HTTP error beside 200, 404,500 etc 

    await page.waitForNavigation({
      waitUntil: 'domcontentloaded'
    }, {
      waitUntil: 'networkidle0'
    }, ); // Wait for page to load before screenshot
    await page.bringToFront(); // Get the tab to focus 
    wait(100); // Blocking wait
    console.log('Waiting 5 sec');
    blockingWait(5); // different kind of wait
    getscreenshot(url, page);
    console.log('screenshot exited');
  };

  const extractTitle = async ({
    page,
    data: url
  }) => {
    console.log('scrapelinks entered');
    await page.setExtraHTTPHeaders({
      'CUSTOMER-ID': "66840"
    }, );
    await page.setViewport({
      width: 1600,
      height: 1200
    });
    await page.goto(url);
    const pageTitle = await page.evaluate(() => document.title); // will later used to confirm the page matches with client details.
    // get all Links on the page
    const hrefs = await page.$$eval('a', hrefs => hrefs.map((a) => {
      return {
        href: a.href,
        text: a.textContent,
      };
    }));
    // get 1st links matching text or link value having bioanalyzer-systems/instrument-2100.xhtml
    for (let postUrl of hrefs) {
      if (postUrl.text.indexOf("Client-s") > -1) {
        cluster.execute(postUrl.href, screenshot); // add this link also to queue
      } else if (postUrl.href.indexOf("bioanalyzer-systems/instrument-2100.xhtml") > -1) {
        cluster.execute(postUrl.href, screenshot); // add this url to queue
        break;
      }
    }
    console.log('scrapelinks exited');
  };

  // Make screenshots
  cluster.execute('http://www.internal-site.int/en/product/66840?product=NEW&CodeList=bio&Id=66840', screenshot);
  cluster.execute('http://www.internal-site.int/en/product/66840?product=USED&CodeList=nonbio&Id=66840', screenshot);

  // But also do some other stuff
  cluster.execute('http://www.internal-site.int/en/product/66840?product=NEW&CodeList=bio&Id=66840', extractTitle);
  cluster.execute('http://www.internal-site.int/en/product/66840?product=USED&CodeList=nonbio&Id=66840', extractTitle);

  await cluster.idle();
  await cluster.close();
})();```

I expect output to take screenshot once the page or tab load is completed.

【问题讨论】：

标签： node.js puppeteer puppeteer-cluster

【解决方案1】：

一旦函数执行完毕（或 Promise 已解决），页面就会关闭。您没有使用await 来等待异步操作完成。

例如，在您的screenshot 函数中，有以下代码：

wait(100);
console.log('Waiting 5 sec');
blockingWait(5);
getscreenshot(url, page);
console.log('screenshot exited');

第一行调用了wait函数（即async），但由于你不是await，所以函数会在后台执行，Node.js会继续执行你的脚本。

blockingWait 不是类似 JavaScript 的代码编写方式。这完全阻止了执行。

getscreenshot 函数应该再次为async，以便您可以await 它。此外，一些 puppeteer 函数调用应该在它们前面加上 await（例如 page.bringToFront）以等待它们完成。

一般来说，您应该查看async/await 和Promises 的概念，以了解您应该在何处以及为何使用这些关键字。

【讨论】：

我不知道如何解决这个问题。我已经按照另一篇文章中的建议使用了 Promise。但是它不起作用。等待页面.bringToFront();然后截图。但这无济于事。类似``` await Promise.all([ page.bringToFront(), execSync('import -window root ' + path +'.jpg'), screenshot(page,url), console.log(closing page: ${url}), page.close(), ]); ``` .. 任何关于如何使页面出现在前面的建议.. 等待页面完全显示，然后使用导入（imagemagick 实用程序）截取屏幕截图。