NodeJS - 读取 HTML 头部标签答案

【问题标题】：NodeJS - Read HTML Head TagsNodeJS - 读取 HTML 头部标签
【发布时间】：2019-05-08 08:32:09
【问题描述】：

我想在我的 nodejs 应用程序中抓取一个 HTML 页面并形成一个 head 标签列表。例如：

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <link rel="stylesheet" href="style.css">
    <link rel="shortcut icon" href="favicon.ico" type="image/x-icon">
    <script src="script.src"></script>
</head>
<body>
    ...
</body>
</html>

期望的输出：

['<meta charset="UTF-8">','<meta name="viewport" content="width=device-width, initial-scale=1.0">','<title>Document</title>', ...etc]

但我有点卡住了，因为元标记不会“关闭”，所以它需要的不仅仅是简单的正则表达式和拆分。我想使用DOMParser，但我在节点环境中。我尝试了xmldom npm 包，但它只返回了一个换行符列表（\r\n）。

【问题讨论】：

您是否需要输出完全符合该格式，或者您只是想获得可以以某种方式操作的标签的逻辑集合？
@CertainPerformance 是的，它不需要需要采用这种格式。但我想对其进行迭代并能够读取标签名称及其属性。我只需要它来读取，而不是操纵 dom。

标签： html node.js web-scraping domparser

【解决方案1】：

一种选择是使用 Cheerio 解析 HTML 并从每个元素中提取您需要的信息：

const cheerio = require('cheerio');
const htmlStr = `<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <link rel="stylesheet" href="style.css">
    <link rel="shortcut icon" href="favicon.ico" type="image/x-icon">
    <script src="script.src"></script>
</head>
<body>
    ...
</body>
</html>`;
const $ = cheerio.load(htmlStr);
const headTags = [];
$('head > *').each((_, elm) => {
  headTags.push({ name: elm.name, attribs: elm.attribs, text: $(elm).text() });
});
console.log(headTags);

输出：

[ { name: 'meta', attribs: { charset: 'UTF-8' }, text: '' },
  { name: 'meta',
    attribs:
     { name: 'viewport',
       content: 'width=device-width, initial-scale=1.0' },
    text: '' },
  { name: 'title', attribs: {}, text: 'Document' },
  { name: 'link',
    attribs: { rel: 'stylesheet', href: 'style.css' },
    text: '' },
  { name: 'link',
    attribs:
     { rel: 'shortcut icon',
       href: 'favicon.ico',
       type: 'image/x-icon' },
    text: '' },
  { name: 'script', attribs: { src: 'script.src' }, text: '' } ]

【讨论】：

【解决方案2】：

使用request npm 请求您的页面，然后在您获得响应后，使用cheerio npm 解析并从原始数据中获取您想要的任何内容。

注意：cheerio 的语法类似于 jQuery

 var request = require('request');
 var cheerio = require('cheerio')

app.get('/scrape',(req,res)=>{

request('---your website url to scrape here ---', function (error, response, body) {    
       var $ = cheerio.load(body.toString())
       let headContents=$('head').children().toString();
       console.log('headContents',headContents)
  });

});

【讨论】：