【问题标题】:Scrape Table With Merge Header带有合并标题的 Scrape 表
【发布时间】:2023-01-08 04:55:30
【问题描述】:

我现在有关于使用 cheerio nodejs 抓取标题表(合并)的问题,这意味着我正在使用它进行分组或其他操作。我可以在没有标题的情况下报废。在这里一点点 Screenshot Table

和 html 表的 html 代码或 html 表的 fiddle here

 <div class="wrap">
   <table class="tbl">
     <tr class="head">
       <td colspan="6" style="background-color:#656968">Monday</td>
     </tr>
     <tr class="head">
       <td class="center" width="20%">Code</td>
       <td class="center" width="40%">Title</td>
       <td class="center" width="20%">Price</td>
       <td class="center last" width="20%">Status</td>
     </tr>
     <tr class="td1">
       <td class="center">Code 1</td>
       <td class="center">Name 1</td>
       <td class="center">1.234</td>
       <td class="center last">
         <span class="green">Closed</span>
       </td>
     </tr>
   </table>
   <table class="tbl">
     <tr class="head">
       <td colspan="6" style="background-color:#656968">Tuesday</td>
     </tr>
     <tr class="head">
       <td class="center" width="20%">Code</td>
       <td class="center" width="40%">Title</td>
       <td class="center" width="20%">Price</td>
       <td class="center last" width="20%">Status</td>
     </tr>
     <tr class="td1">
       <td class="center">Code 1</td>
       <td class="center">Name 1</td>
       <td class="center">1.234</td>
       <td class="center last">
         <span class="green">Closed</span>
       </td>
     </tr>
   </table>
   <table class="tbl">
     <tr class="head">
       <td colspan="6" style="background-color:#656968">Wednesday</td>
     </tr>
     <tr class="head">
       <td class="center" width="20%">Code</td>
       <td class="center" width="40%">Title</td>
       <td class="center" width="20%">Price</td>
       <td class="center last" width="20%">Status</td>
     </tr>
     <tr class="td1">
       <td class="center">Code 1</td>
       <td class="center">Name 1</td>
       <td class="center">1.234</td>
       <td class="center last">
         <span class="green">Closed</span>
       </td>
     </tr>
     <tr class="td2">
       <td class="center">Code 1</td>
       <td class="center">Name 1</td>
       <td class="center">1.234</td>
       <td class="center last">
         <span class="green">Closed</span>
       </td>
     </tr>
     <tr class="td1">
       <td class="center">Code 1</td>
       <td class="center">Name 1</td>
       <td class="center">1.234</td>
       <td class="center last">
         <span class="green">Closed</span>
       </td>
     </tr>
   </table>
   <table class="tbl">
     <tr class="head">
       <td colspan="6" style="background-color:#656968">Thursday</td>
     </tr>
     <tr class="head">
       <td class="center" width="20%">Code</td>
       <td class="center" width="40%">Title</td>
       <td class="center" width="20%">Price</td>
       <td class="center last" width="20%">Status</td>
     </tr>
     <tr class="td1">
       <td class="center">Code 1</td>
       <td class="center">Name 1</td>
       <td class="center">1.234</td>
       <td class="center last">
         <span class="green">Closed</span>
       </td>
     </tr>
   </table>
 </div>

这是我的 cheerio :

   const sel = "tr.td1, tr.td2";
$(sel).each(function (i, e) {

  $(this).find("td:first").each(function (i, e) {
    code.push({
      code: $(this).text().trim()
    })
  });
  $(this).find("td:eq(1)").each(function (i, e) {
    title.push({
      title: $(this).text().trim()
    })
  });
  $(this).find("td:eq(2)").each(function (i, e) {
    price.push({
      price: $(this).text().trim()
    })
  });
  $(this).find("td:eq(3)").each(function (i, e) {
    status.push({
      status: $(this).text().trim()
    })
  });
let merged = [];
for (var i = 0; i < code.length; i++) {
  merged.push({
    ...code[i],
    ...title[i],
    ...price[i],
    ...status[i]
  })
}

是的,我能够像我希望的那样得到数组,看起来像

[
  {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
  },
 {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
  },
 {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
  }
]

我需要的是,在 json 中我有日值,这是在标题合并的位置,我需要的最终结果看起来像这样

[
  {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
    "group": "Monday"

  },
 {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
    "group": "Monday"
  },
 {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
    "group": "Monday"
  },
      {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
    "group": "Tuesday"

  },
 {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
    "group": "Tuesday"
  },
 {
    "code": "Code 1",
    "title": "Name 1",
    "price": "1.234",
    "status": "Closed",
    "group": "Tuesday"
  }
]

【问题讨论】:

    标签: javascript node.js web-scraping cheerio


    【解决方案1】:

    与其从底部开始并尝试返回到父组,不如循环遍历父母,然后用手头所需的分组信息抓住他们的孩子。然后,您可以创建按组组织的嵌套结构,或将其展平为您期望的结果:

    const cheerio = require("cheerio"); // 1.0.0-rc.12
    
    const html = `<Your HTML from the question>`;
    const headers = ["code", "title", "price", "status"];
    const $ = cheerio.load(html);
    const data = [...$(".tbl")].flatMap(table =>
      [...$(table).find(".td1, td2")].map(row => ({
        ...Object.fromEntries([...$(row).find("td")].map((e, i) =>
          [headers[i], $(e).text().trim()]
        )),
        group: $(table).find(".head").first().text().trim(),
      }))
    );
    console.log(data);
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-03-16
      • 2023-01-13
      • 1970-01-01
      • 2021-09-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多