【问题标题】:How to scrape a new page using puppeteer?如何使用 puppeteer 抓取新页面?
【发布时间】:2020-07-02 06:46:20
【问题描述】:

我尝试使用 puppeteer 和 Node.js 抓取 Reddit。有我的代码,我在哪里:

  1. 为 Reddit 的主页打开一个页面,
  2. 获取所有帖子。
  3. 对于每个帖子,我都会获得指向其内容页面的链接。
  4. 为每个内容页面打开一个新页面。
  5. 抓取每个内容页面。
const puppeteer = require("puppeteer");

const self = {
  browser: null,
  page: null,

  initialize: async () => {
    browser = await puppeteer.launch({
      headless: false,
    });
    page = await browser.newPage();

    // Go to the index page of Reddit
    await page.goto("https://old.reddit.com/", { waitUntil: "networkidle0" });
  },

  getResults: async () => {
    let platform = "Reddit";

    // Get all posts on the main page of Reddit.
    let mentions = await page.$$('#siteTable > div[class *= "thing"]');
    let results = [];

    // For each post:
    for (let mention of mentions) {
      let content = "";

      // I get the link to its content page.
      let content_URL = await mention.$eval(
        'p[class="title"] > a[class*="title"]',
        (node) => node.getAttribute("href").trim()
      );

      // if it is a inner link:
      if (content_URL.substr(0, 3) === "/r/") {

        // Create a new page to open that content page. 
        let contentPage = await browser.newPage();
        await contentPage.goto("https://old.reddit.com" + content_URL, {
          waitUntil: "networkidle0",
        });

        // Get the first paragraph of this content page.
        content = await contentPage.evaluate((contentPage) => {
          
          // Here is where the error occurred: 
          // Error: Evaluation failed: TypeError: Cannot read property 'querySelector' of undefined
          let firstParagraph = contentPage.querySelector(
            'div[class*="usertext-body"] > p'
          );

          if (firstParagraph != null) {
            return firstParagraph.innerText.trim();
          } else {
            return null;
          }
        });
      }

      results.push({
        title,
        content,
        image,
        date,
        popularity,
        platform,
      });
    }

    return results;
  },
};

module.exports = self;

但发生错误:Error: Evaluation failed: TypeError: Cannot read property 'querySelector' of undefined

谁能指出我哪里做错了?

谢谢!

【问题讨论】:

  • contentPage 未定义。
  • @RobertHarvey 但我确实在let contentPage = await browser.newPage();中定义了它

标签: javascript node.js web-scraping puppeteer


【解决方案1】:

page.evaluate 基本上是在浏览器的上下文中执行代码。 IE:与您放入浏览器开发者控制台以获得相同结果的相同内容。因此,在这种情况下,您可能希望使用 document.querySelector() 而不是对未定义的 contentPage 的引用:

let firstParagraph = document.querySelector(
  'div[class*="usertext-body"] > p'
);

【讨论】:

    猜你喜欢
    • 2019-08-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-23
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多