如何从 DOM 中获取所有链接？答案

【问题标题】：How to get all links from the DOM?如何从 DOM 中获取所有链接？
【发布时间】：2018-09-04 15:08:15
【问题描述】：

根据https://github.com/GoogleChrome/puppeteer/issues/628，我应该能够从获取所有链接：

const hrefs = await page.$$eval('a', a => a.href);

但是当我尝试一个简单的：

console.log(hrefs)

我只得到：

http://example.de/index.html

... 作为输出，这意味着它只能找到 1 个链接？但是页面在源代码/DOM 中肯定有 12 个链接。为什么都找不到？

小例子：

'use strict';
const puppeteer = require('puppeteer');

crawlPage();

function crawlPage() {
    (async () => {
	
	const args = [
            "--disable-setuid-sandbox",
            "--no-sandbox",
            "--blink-settings=imagesEnabled=false",
        ];
        const options = {
            args,
            headless: true,
            ignoreHTTPSErrors: true,
        };

	const browser = await puppeteer.launch(options);
        const page = await browser.newPage();
	await page.goto("http://example.de", {
            waitUntil: 'networkidle2',
            timeout: 30000
        });
     
	const hrefs = await page.$eval('a', a => a.href);
        console.log(hrefs);
		
        await page.close();
	await browser.close();
		
    })().catch((error) => {
        console.error(error);
    });;

}

【问题讨论】：

标签： javascript node.js web-crawler puppeteer headless-browser

【解决方案1】：

在您的示例代码中，您使用的是page.$eval，而不是page.$$eval。由于前者使用document.querySelector 而不是document.querySelectorAll，因此您描述的行为是预期的。

另外，您应该在 $$eval 参数中更改您的 pageFunction：

const hrefs = await page.$$eval('a', as => as.map(a => a.href));

【讨论】：

如果我使用 page.$$eval 我得到“未定义”作为输出。
非常感谢，这行得通。是不是说明 Github 页面上的代码示例有误？
这是“错误的”，因为它现在不能以这种方式工作，但是如果您重新阅读该问题，您会发现该示例代码有点“概念证明” $$eval 实施后将如何工作的示例（现在已实施，但工作方式略有不同）。

【解决方案2】：

page.$$eval() 方法在页面内运行Array.from(document.querySelectorAll(selector))，并将其作为第一个参数传递给页面函数。

由于您的示例中的a 表示一个数组，您需要指定要从数组的哪个元素获取href，或者您需要map 所有href 属性到一个数组。

page.$$eval()

const hrefs = await page.$$eval('a', links => links.map(a => a.href));

或者，您也可以使用page.evaluate() 或page.$$()、elementHandle.getProperty() 或jsHandle.jsonValue() 的组合来实现页面中所有链接的数组。

page.evaluate()

const hrefs = await page.evaluate(() => {
  return Array.from(document.getElementsByTagName('a'), a => a.href);
});

page.$$() / elementHandle.getProperty() / jsHandle.jsonValue()

const hrefs = await Promise.all((await page.$$('a')).map(async a => {
  return await (await a.getProperty('href')).jsonValue();
}));

【讨论】：