【问题标题】:Dynamic array to JSON or Table to JSON动态数组到 JSON 或表到 JSON
【发布时间】:2019-09-21 20:04:07
【问题描述】:

我已经搜索并试图让它工作了几个星期,但我仍然失败了。我尝试了几种方法,包括 reg 表达式。我有一个动态表,我用 puppeteer 抓取它,我正在尝试将该数据输出为 JSON。问题是标题“2nd flr (Rm 226)”和“Location - Room 115”可能显示也可能不显示。这些房间中的事件可能有 1 个或多个事件。如何转换这样的动态数据并确保列出所有内容?

我正在尝试获取类似 JSON 的内容。

data: [

  {
    "location": "2nd flr (Rm 226)"
    "time": "10:00 AM",
    "description": "Social Security Administration Commissioner",
    "document": "18",
    "type": "Social Security Hearing",
    "blank": " ",
    "order": "Hearing"
  },
  {
    "location": "2nd flr (Rm 226)"
    "time": "01:00 PM",
    "description:"
    "Social Security Administration Commissioner",
    "document": "18",
    "type": "Order Setting Social Security Hearing",
    "blank": " ",
    order: "Hearing"

  },
  {
    "location": "3rd flr (100)"
    "time": "01:00 PM",
    "description:"
    "Social Security Administration Commissioner",
    "document": "18",
    "type": "Order Setting Social Security Hearing",
    "blank": " ",
    order: "Hearing"

  }

]


const data = Array.from(
  document.querySelectorAll('#content > table > tbody > tr'),
  row => Array.from(row.querySelectorAll('td'), cell => cell.innerText)
)

这是我得到的输出。

{
  "data": [
    [
      "2nd flr (Rm 226)"
    ],
    [
      "10:00 AM",
      "Social Security Administration Commissioner",
      "18",
      "Social Security Hearing",
      " ",
      "Hearing"
    ],
    [
      "01:00 PM",
      "Social Security Administration Commissioner",
      "18",
      "Order Setting Social Security Hearing",
      " ",
      "Hearing"
    ],
    [
      "3rd flr (100)"
    ],
    [
      "09:30 AM",
      "TERMINATED on 03/23/2015",
      "34",
      "Resetting Hearings",
      " ",
      "Hearing"
    ],
    [
      " ",
      "Reserved for case",
      "23",
      "Motion Hearing",
      " ",
      "Hearing"
    ],
    [
      "01:00 PM",
      "Case Information",
      "19",
      "Order Setting",
      " ",
      "Hearing"
    ],
    [
      "01:30 PM",
      "Case information",
      "31",
      "Order Setting",
      " ",
      "Hearing"
    ],
    [
      " ",
      "TERMINATED on 06/14/2019",
      "16",
      "Order Setting/Resetting Hearings",
      " ",
      "Hearing"
    ],
    [
      "3rd flr (Rm 310)"
    ],
    [
      "01:30 PM",
      "Insurance Company",
      "122",
      "Order Setting/Resetting Hearings",
      " ",
      "Hearing"
    ]
  ]
}
<center><Table border=1 width=98%>
<TR><TD id='report' class='report' align=center><B><FONT SIZE=+2>Daily Calendar Report of 09/23/2019</font></B><BR><CENTER></table></center>
<Table border=1 width=98%   >

<TR><TD class='room' id='room' ALIGN=CENTER COLSPAN=6><STRONG>2nd flr (Rm 226)</STRONG></TD></TR>
<TR id='casedata' class='casedata'>
<TD class=case-0 id=case-0 VALIGN=top NOWRAP>10:00 AM</TD>
<TD class=case-1 id=case-1 VALIGN=top><A HREF=/Reportpt.pl?55244>Social Security Administration</A><B></B></TD>
<TD class=case-2 id=case-2 VALIGN=top>18</TD>
<TD class=case-3 id=case-3 VALIGN=top>Security Hearing</TD>
<TD class=case-4 id=case-4 VALIGN=top>&nbsp</TD>
<TD class=case-5 id=case-5 VALIGN=top NOWRAP><I>Hearing</I></TD>
</TR>

<TR><TD class='room' id='room' ALIGN=CENTER COLSPAN=6><STRONG>2nd flr (Rm 406)</STRONG></TD></TR>
<TR id='casedata' class='casedata'>
<TD class=case-0 id=case-0 VALIGN=top NOWRAP>1:30 PM</TD>
<TD class=case-1 id=case-1 VALIGN=top><A HREF=/Reportpt.pl?55244>Social Security Administration</A><B></B></TD>
<TD class=case-2 id=case-2 VALIGN=top>18</TD>
<TD class=case-3 id=case-3 VALIGN=top>Security Hearing</TD>
<TD class=case-4 id=case-4 VALIGN=top>&nbsp</TD>
<TD class=case-5 id=case-5 VALIGN=top NOWRAP><I>Hearing</I></TD>
</TR>
</table>
const tds = Array.from(document.querySelectorAll('#Content > table > tbody > tr > td'));
const trs = Array.from(document.querySelectorAll('#Content > table > tbody > tr'))

const data = Array.from(
      document.querySelectorAll('#Content > table > tbody > tr'),
      row => Array.from(row.querySelectorAll('td'), cell => cell.innerText),
      data =>{ return ( [data] ) }
    )

【问题讨论】:

  • 看起来innertText 并不总能得到你想要的。如果你能提供你试图从中抓取的 HTML,那将会很有帮助。

标签: javascript web-scraping puppeteer


【解决方案1】:

您需要在 HTML 中寻找更多线索来实现您想要的结构。在这种情况下,我在每个 tr 中查找第一个 td 的类。

[[ 我知道你只是阅读 HTML,但它是错误,因为其中分配了多个相同的ids (room) ...]]

// define shortcut function qsa: querySelectorAll, returning a proper array
// An HTML context `el` can be given as an optional second parameter
function qsa(s,el){
 return Array.prototype.map.call((el?Element:Document).prototype
             .querySelectorAll.call((el||document),s),function(e){return e})
}
data=[];
qsa('tr').forEach(function(tr,i,arr){
 var tds=qsa('td',tr); 
 if (tds[0].className=='room') 
  arr.room=tds[0].innerText // "remember" the current room data ...
 else if (tds[0].className=='case-0') 
  data.push([arr.room].concat(tds.map(function(e){return e.innerText}))) // output room and row data
});

console.log(data)

// and, of course, the JSON is created by
var JSONdata=JSON.stringify(data);
<center><Table border=1 width=98%>
<TR><TD id='report' class='report' align=center><B><FONT SIZE=+2>Daily Calendar Report of 09/23/2019</font></B><BR><CENTER></table></center>
<Table border=1 width=98%   >

<TR><TD class='room' id='room' ALIGN=CENTER COLSPAN=6><STRONG>2nd flr (Rm 226)</STRONG></TD></TR>
<TR id='casedata' class='casedata'>
<TD class=case-0 id=case-0 VALIGN=top NOWRAP>10:00 AM</TD>
<TD class=case-1 id=case-1 VALIGN=top><A HREF=/Reportpt.pl?55244>Social Security Administration</A><B></B></TD>
<TD class=case-2 id=case-2 VALIGN=top>18</TD>
<TD class=case-3 id=case-3 VALIGN=top>Security Hearing</TD>
<TD class=case-4 id=case-4 VALIGN=top>&nbsp</TD>
<TD class=case-5 id=case-5 VALIGN=top NOWRAP><I>Hearing</I></TD>
</TR>

<TR><TD class='room' id='room' ALIGN=CENTER COLSPAN=6><STRONG>2nd flr (Rm 406)</STRONG></TD></TR>
<TR id='casedata' class='casedata'>
<TD class=case-0 id=case-0 VALIGN=top NOWRAP>1:30 PM</TD>
<TD class=case-1 id=case-1 VALIGN=top><A HREF=/Reportpt.pl?55244>Social Security Administration</A><B></B></TD>
<TD class=case-2 id=case-2 VALIGN=top>18</TD>
<TD class=case-3 id=case-3 VALIGN=top>Security Hearing</TD>
<TD class=case-4 id=case-4 VALIGN=top>&nbsp</TD>
<TD class=case-5 id=case-5 VALIGN=top NOWRAP><I>Hearing</I></TD>
</TR>
</table>

这个解决方案可能看起来有点过时(没有Array.from,没有箭头功能)。我写了它,所以它仍然可以在 IE 中工作。

【讨论】:

    【解决方案2】:

    我认为您的标题不正确。获取 json 很容易,只需使用 JSON.stringify 你的问题是让你想要转换的对象保持一致,或者至少是你想要的方式 - 看起来代码正在生成很多数组,而不是对象数组

    所以我认为你必须做更多的工作来解析 html。在转换为 json 之前,我会控制台记录来自 html 的对象以检查它。

    因此您可以显式读取每个&lt;td&gt;,而不是在循环中,将值分配给对象或默认值

    【讨论】:

    • 如果我不知道任何一天会有多少 td,我如何明确阅读每个 TD?
    • 你consol记录了所有的td和tr吗?这是您最好的调试线索。您的示例在 html 中显示了一致的 td。
    猜你喜欢
    • 1970-01-01
    • 2023-03-26
    • 2010-12-04
    • 2012-03-21
    • 2019-08-28
    • 2019-12-23
    • 1970-01-01
    • 2013-08-02
    • 2014-04-06
    相关资源
    最近更新 更多