Puppeteer 错误，在抓取白页时无法读取未定义的属性“getProperty”答案

【问题标题】：Puppeteer Error, Cannot read property 'getProperty' of undefined while scraping white pagesPuppeteer 错误，在抓取白页时无法读取未定义的属性“getProperty”
【发布时间】：2020-01-18 08:17:22
【问题描述】：

我正在尝试从 whitepages.com 抓取地址，但我的抓取工具每次运行时都会抛出此错误。

(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined

这是我的代码：

const puppeteer = require('puppeteer')

async function scrapeAddress(url){
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});

    const [el]= await page.$x('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    // console.log(el)
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

    browser.close();

}

scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')

经过一番调查，我意识到 el 变量被返回为未定义，我不知道为什么。我已尝试使用相同的代码从其他站点获取元素，但仅对于该站点，我才收到此错误。

我尝试了完整和短 XPath 以及其他周边元素，但该站点上的所有内容都会引发此错误。

为什么会发生这种情况，有什么办法可以解决吗？

【问题讨论】：

标签： javascript web-scraping puppeteer

【解决方案1】：

您可以尝试将所有内容包装在 try catch 块中，否则尝试使用 then() 来解开承诺。

(async() => {
  const browser = await puppeteer.launch();
  try {
    const page = await browser.newPage();
    await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});

    const [el]= await page.$x('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    // console.log(el)
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

  } catch (err) {
    console.error(err.message);
  } finally {
    await browser.close();
  }
})();

【讨论】：

【解决方案2】：

原因是网站将 puppeteer 检测为自动机器人。将 headless 设置为 false，您可以看到它永远不会导航到网站。

我建议使用puppeteer-extra-plugin-stealth。还要始终确保等待元素出现在页面中。

const puppeteer = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());

async function scrapeAddress(url){
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto(url,{waitUntil: 'networkidle0'});

    //wait for xpath
    await page.waitForXPath('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    const [el]= await page.$x('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    // console.log(el)
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

    browser.close();

}

scrapeAddress('https://www.whitepages.com/business/CA/San-Diego/Cvs-Health/b-1ahg5bs')

【讨论】：

为什么要声明和初始化 const "puppeteer" 两次？应该使用哪一个
没有问题。如果您不介意帮我检查一下，我在我的页面上发布了与此问题相关的帖子，也许您有一些要添加的内容，因为您似乎了解此插件
它首先运行。 “真棒”

【解决方案3】：

我最近遇到了这个错误，更改我的 xpath 对我有用。我有一个抓取完整的 xpath，它导致了一些问题

【讨论】：

【解决方案4】：

很可能是因为网站是响应式的，因此当爬虫运行时，它会显示不同的 XPATH。

我建议您使用无头浏览器进行调试：

const browser = await puppeteer.launch({headless: false});

【讨论】：

【解决方案5】：

我采用了@mbit 提供的代码并根据我的需要对其进行了修改，还使用了无头浏览器。我无法使用无头浏览器做到这一点。如果有人能够弄清楚如何做到这一点，请解释。这是我的解决方案：

首先你必须在控制台 bash 中安装一些东西，所以运行以下两个命令：

npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth

安装这些将允许您运行@mbit 代码中的前几行。然后在这行代码中：

 const browser = await puppeteer.launch();

作为 puppeteer.launch() 的参数；传入以下内容：

{headless: false}

应该是这样的：

const browser = await puppeteer.launch({headless: false});

我也相信@mbit 使用的路径可能不再存在，因此请提供您自己的路径以及站点。您可以使用以下 3 行代码来执行此操作，只需将 {XPath} 替换为您自己的 XPath 并将 {address} 替换为您自己的网址。注意：请注意引号 '' 或 "" 的使用，因为 XPath 地址可能与您习惯使用的地址相同，这会弄乱您的路径。

await page.waitForXPath({XPath});
const [el]= await page.$x({XPath});

scrapeAddress({address})

完成此操作后，您应该能够运行代码并检索值这是我的代码最终的样子，随意复制粘贴到您自己的文件中，以确认它完全适用于您！

let puppeteer = require('puppeteer-extra');
let pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());

puppeteer = require('puppeteer')

async function scrapeAddress(url){
    const browser = await puppeteer.launch({headless: false});

    const page = await browser.newPage();
    await page.goto(url,{waitUntil: 'networkidle0'});

    //wait for xpath
    await page.waitForXPath('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
    const [el]= await page.$x('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
    
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

    browser.close();
}

scrapeAddress("https://stockx.com/air-jordan-1-retro-high-unc-leather")

【讨论】：