使用 Node JS、Puppeteer 和 Cheerio 进行网页抓取答案

【问题标题】：Web Scraping with Node JS, Puppeteer and Cheerio使用 Node JS、Puppeteer 和 Cheerio 进行网页抓取
【发布时间】：2022-02-13 13:21:13
【问题描述】：

我试图做一些网络抓取。我的目标是找到所有直接跟随它们的<H></H> 标签和文本。示例将是：

<div>
 <h2>heading 1</h2>
 <p>text 1</p>
 <h3>heading 2></h3>
 <p>text 2</p>
</div>

结果是：

[
 {
        "title": "heading 1",
        "tag": "h2",
        "text": "text 1"
    },
    {
        "title": "heading 2",
        "tag": "h3",
        "text": "text 2"
    },
]

我的解决方案：

let headings = $('h1, h2, h3, h4, h5, h6');
let result = {}
    for (let i = 0; i < headings.length; i++) {
        let newObj = {};
        //preparing title from heading
        newObj.title = this.#cleanText($(headings[i]).text());

        //tag name of the heading
         newObj.tag = $(headings[i])[0].name;
        if($(headings[i]).nextUntil($(headings[i + 1])).find('img')){
          numberOfImage++; 
        }
       $(headings[i]).nextUntil($(headings[i + 1])).find('img').remove()
       $(headings[i]).next().find($(headings[i + 1])) && "got new heading"
        //getting html markup as string between current heading and next one
       if (i === headings.length - 1){
           newObj.text = $(headings[i]).nextAll().toString();
       }else{
           newObj.text = $(headings[i]).nextUntil($(headings[i + 1])).toString();
       }
       result.content.push(newObj)
   }

如您所见，如果所有 <h></h> 标记都是兄弟，我就可以了，但我无法处理任何嵌套标题，例如：

<div>
 <div><h3>heading 1></h3>
 <p>text 1</p>
  <h2>heading 2</h2>
 </div>
 <p>text 2</p>
 <h3>heading 3></h3>
 <p>text 3</p>
<div>
 <h3>heading 2></h3>
 <p>text 2</p>
</div>
</div>

我需要 Result 是这样的：

[
 {
        "title": "heading 1",
        "tag": "h2",
        "text": "text 1"
    },
    {
        "title": "heading 2",
        "tag": "h3",
        "text": "text 2"
    },

    {
        "title": "heading 3",
        "tag": "h3",
        "text": "text 3"
    },
]

如果有人可以为此提供解决方案，那将非常有帮助

【问题讨论】：

您的标记和扩展没有<img>，但您的代码却费力地处理这些标签。您确定示例标记没有被过度简化吗？它到底是什么样子的？
我不会拍摄任何图像或数字，只会拍摄包含 p、li、span 等标签的内容
只需使用 $('*').each(element=> addCodeHere) 遍历 DOM。检查元素 tagName 是否在数组 '['H1', 'H2', 'H3', 'H4', 'H5' 'H6'] 中。如果遇到标题元素，请检查 previousElementSibling 和/或 nextElementSibling 以确定它是否为 p 标签。

标签： javascript node.js web-scraping puppeteer cheerio

【解决方案1】：

它应该看起来像：

$('h1, h2, h3, h4, h5, h6').get().map(h => {
  return {
    title: $(h).text(),
    text: $(h).find('~ p').first().text()
  }
})

我正在使用 ~ ，它遵循同级选择器（仅限新版本的cheerio）

【讨论】：

谢谢，但这仍然没有在第一个 h 标签下给出任何文本，我已经编辑了这个问题。所以你现在可以看到我所期待的了
是的，如果没有下面的 p 兄弟，那么它将是空的。