【问题标题】:Looping through an array of links gives navigation timeout error - puppeteer循环遍历一系列链接会导致导航超时错误 - puppeteer
【发布时间】:2020-08-19 02:24:57
【问题描述】:

我有一个按钮元素数组,我想一个一个地单击它们,并为每个打开的新选项卡执行此操作:

  1. 抓取一些信息并存储在一个名为“providers”的数组中
  2. 关闭该选项卡

虽然我能够做到这一点,但由于我在 browser.pages() 之前使用的导航组件,我不断收到超时错误。如果我删除该组件,我会收到另一个超时错误。此外,每次我运行程序时,在按钮数组的迭代次数不同后都会遇到超时错误。这是我的代码:

const puppeteer = require("puppeteer");

(async () => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
    });
    const page = await browser.newPage();

    //google.com
    await page.setExtraHTTPHeaders({ "Accept-Language": "en-US" });
    await page.goto("https://google.com");
    await page.type("input.gLFyf.gsfi", "hotels in london");
    await page.keyboard.press("Enter");

    //search results
    await page.waitForXPath('//span[contains(text(),"View ")]');
    const btn1 = await page.$x('//span[contains(text(),"View ")]');
    await btn1[0].click();

    //list of hotels
    await page.waitForXPath('//span[contains(text(),"Learn more")]');

    let hotels = [];
    
    //buttons array that contains a list of buttons
    let buttons = await page.$x("//button[contains(., 'View prices')]");
 
    //prints a different value each time the program is run
    console.log(buttons.length);
 
    //looping through buttons array
    for (var i = 0; i < buttons.length; i++) {

      //i = 1 or 0 when program hangs 
      console.log("got here " + I);

      //*******************************cause of timeout error******************************************

      await page.setDefaultNavigationTimeout(0);
      await Promise.all([
        page.waitForNavigation({ waitUntil: "load", timeout: 0 }),
        buttons[i].click(),
      ]);

      //***********************************************************************************************

      //getting all open tabs in an array
      const pages = await browser.pages();
      const page2 = pages[pages.length - 1];
      console.log(pages.length);

      //newly opened tab, sometimes program hangs before opening a new tab
      await page2
        .waitForSelector(
          "#prices > c-wiz > div > div.G86l0b > div > div > div > div > div > section > div.Hkwcrd.q9W60.A5WLXb.fLClSe > c-wiz > div > div > span > div > div > div > div > div > a > div > div.cFdfnb > div > span.mK0tQb > span",
          { timeout: 30000 }
        )
        .catch(() => console.log("Class doesn't exist!"));

      /*-----------------scraping information on new tab ----------------------------------*/

      console.log("going to start collecting providers");
      let providers = await page2.evaluate(() => {
        let data = [];
        let elements = document.querySelectorAll(
          "#prices > c-wiz > div > div.G86l0b > div > div > div > div > div > section > div.Hkwcrd.q9W60.A5WLXb.fLClSe > c-wiz > div > div > span > div > div > div > div > div > a > div > div.cFdfnb > div > span.mK0tQb > span"
        );
        for (var element of elements) data.push(element.textContent);
        return data;
      });
      console.log(providers.length);
      console.log("all done");
      console.log(providers);
      hotels.push(providers);

      //closing the new tab
      page2.close();
    }
    
    await browser.close();
    return hotels;
  } catch (err) {
    console.error(err);
  }
})()
  .then((resolvedValue) => {
    console.log(resolvedValue);
  })
  .catch((rejectedValue) => {
    console.log(rejectedValue);
  });


为了摆脱错误,我使用了 timeout: 0 和 setDefaultNavigationTimeout(0),但现在程序只是冻结了。这是我在禁用超时获取之前遇到的错误:

TimeoutError: Navigation timeout of 30000 ms exceeded
    at C:\Users\Me\Desktop\web_scraping_practice\node_modules\puppeteer\lib\LifecycleWatcher.js:100:111
    at async FrameManager.waitForFrameNavigation (C:\Users\Me\Desktop\web_scraping_practice\node_modules\puppeteer\lib\FrameManager.js:107:23)
    at async Frame.waitForNavigation (C:\Users\Me\Desktop\web_scraping_practice\node_modules\puppeteer\lib\FrameManager.js:298:16)
    at async Page.waitForNavigation (C:\Users\Me\Desktop\web_scraping_practice\node_modules\puppeteer\lib\Page.js:560:16)
    at async Promise.all (index 0)
    at async C:\Users\Me\Desktop\web_scraping_practice\backend.js:41:7
  -- ASYNC --
    at Frame.<anonymous> (C:\Users\Me\Desktop\web_scraping_practice\node_modules\puppeteer\lib\helper.js:116:19)
    at Page.waitForNavigation (C:\Users\Me\Desktop\web_scraping_practice\node_modules\puppeteer\lib\Page.js:560:53)
    at Page.<anonymous> (C:\Users\Me\Desktop\web_scraping_practice\node_modules\puppeteer\lib\helper.js:117:27)
    at C:\Users\Me\Desktop\web_scraping_practice\backend.js:42:14
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  name: 'TimeoutError'
}
undefined

谢谢

【问题讨论】:

    标签: navigation timeout puppeteer


    【解决方案1】:

    尝试运行您的代码,如果您按内容搜索跨度,那么硬编码 Chromium 语言环境似乎是明智的,因为在我的浏览器中它们不是英文的。但我稍微调整了一下,设法打开了一个包含酒店详细信息的标签。问题是这个选择器:

    $("#prices > c-wiz > div > div.G86l0b > div > div > div > div > div > section > div.Hkwcrd.q9W60.A5WLXb.fLClSe > c-wiz > div > div > span > div > div > div > div > div > a > div > div.cFdfnb > div > span.mK0tQb > span");
    

    不幸的是,这个东西渲染了null。我相信这些类div.Hkwcrd.q9W60.A5WLXb.fLClSe 是动态生成的。不确定您实际要提取哪些信息,但我会尝试通过此 data-click-type 属性查找 DOM 元素。就我而言,它产生:

    document.querySelectorAll("div[data-click-type='283']");
    NodeList(18) [div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd, div.YPrvOd]
    

    这似乎是房间的类型(高级双人间等)。 “268”点击类型似乎是带有酒店的网站(booking、hotels.com 等)

    以下代码:

    const puppeteer = require("puppeteer");
    
    (async () => {
      try {
        const browser = await puppeteer.launch({
          headless: false,
        });
        const page = await browser.newPage();
    
        //google.com
        await page.setExtraHTTPHeaders({ "Accept-Language": "en-US" });
        await page.goto("https://google.com");
        await page.type("input.gLFyf.gsfi", "hotels in london");
        await page.keyboard.press("Enter");
    
        //search results
        await page.waitForXPath('//span[contains(text(),"View ")]');
        const btn1 = await page.$x('//span[contains(text(),"View ")]');
        await btn1[0].click();
    
        //list of hotels
        await page.waitForXPath('//span[contains(text(),"Learn more")]');
    
        let hotels = [];
    
        //buttons array that contains a list of buttons
        let buttons = await page.$x("//button[contains(., 'View prices')]");
    
        //prints a different value each time the program is run
        console.log(buttons.length);
    
        //looping through buttons array
        for (var i = 0; i < buttons.length; i++) {
    
          //i = 1 or 0 when program hangs
          console.log("got here " + i);
    
          //*******************************cause of timeout error******************************************
    
          await page.setDefaultNavigationTimeout(0);
          await Promise.all([
            page.waitForNavigation({ waitUntil: "load", timeout: 0 }),
            buttons[i].click(),
          ]);
    
          //***********************************************************************************************
    
          //getting all open tabs in an array
          const pages = await browser.pages();
          const page2 = pages[pages.length - 1];
          console.log(pages.length);
    
          //newly opened tab, sometimes program hangs before opening a new tab
          await page2
            .waitForSelector(
              "span[data-click-type='268']",
              { timeout: 30000 }
            )
            .catch(() => console.log("Class doesn't exist!"));
    
          /*-----------------scraping information on new tab ----------------------------------*/
    
          console.log("going to start collecting providers");
          let providers = await page2.evaluate(() => {
            let data = [];
            let elements = document.querySelectorAll(
              "span[data-click-type='268']"
            );
            for (var element of elements) data.push(element.textContent);
            return data;
          });
          console.log(providers.length);
          console.log("all done");
          console.log(providers);
          hotels.push(providers);
    
          //closing the new tab
          page2.close();
        }
    
        await browser.close();
        return hotels;
      } catch (err) {
        console.error(err);
      }
    })()
      .then((resolvedValue) => {
        console.log(resolvedValue);
      })
      .catch((rejectedValue) => {
        console.log(rejectedValue);
      });
    

    在我的情况下呈现以下内容:

    (node:16816) ExperimentalWarning: The fs.promises API is experimental
    12
    got here 0
    3
    going to start collecting providers
    16
    all done
    [ 'Booking.com',
      'Tripadvisor.com',
      'Agoda',
      'Hotels.com',
      'Booking.com',
      'Tripadvisor.com',
      'Agoda',
      'Hotels.com',
      'Expedia.com',
      'Destinia',
      'Stayforlong.com',
      'Trip.com',
      'ebookers.ie',
      'Etrip',
      'ZenHotels.com',
      'Nustay.com' ]
    got here 1
    

    我相信providers 的列表。 注意使用的选择器:span[data-click-type='268']

    【讨论】:

    • 我尝试更改选择器,但程序仍然超时。基本上,我想要为该城市显示的每家酒店的供应商列表。目前,该程序仅打印一/两家酒店的供应商列表,然后程序崩溃。你知道我该如何解决这个问题吗?
    • 在程序超时的页面上,您是否尝试过在 Chromium 中打开控制台并手动查找选择器?
    猜你喜欢
    • 2015-09-03
    • 1970-01-01
    • 2013-11-24
    • 2012-04-13
    • 1970-01-01
    • 1970-01-01
    • 2014-12-16
    • 2021-07-25
    • 2020-11-30
    相关资源
    最近更新 更多