【发布时间】:2019-09-21 20:04:07
【问题描述】:
我已经搜索并试图让它工作了几个星期,但我仍然失败了。我尝试了几种方法,包括 reg 表达式。我有一个动态表,我用 puppeteer 抓取它,我正在尝试将该数据输出为 JSON。问题是标题“2nd flr (Rm 226)”和“Location - Room 115”可能显示也可能不显示。这些房间中的事件可能有 1 个或多个事件。如何转换这样的动态数据并确保列出所有内容?
我正在尝试获取类似 JSON 的内容。
data: [
{
"location": "2nd flr (Rm 226)"
"time": "10:00 AM",
"description": "Social Security Administration Commissioner",
"document": "18",
"type": "Social Security Hearing",
"blank": " ",
"order": "Hearing"
},
{
"location": "2nd flr (Rm 226)"
"time": "01:00 PM",
"description:"
"Social Security Administration Commissioner",
"document": "18",
"type": "Order Setting Social Security Hearing",
"blank": " ",
order: "Hearing"
},
{
"location": "3rd flr (100)"
"time": "01:00 PM",
"description:"
"Social Security Administration Commissioner",
"document": "18",
"type": "Order Setting Social Security Hearing",
"blank": " ",
order: "Hearing"
}
]
const data = Array.from(
document.querySelectorAll('#content > table > tbody > tr'),
row => Array.from(row.querySelectorAll('td'), cell => cell.innerText)
)
这是我得到的输出。
{
"data": [
[
"2nd flr (Rm 226)"
],
[
"10:00 AM",
"Social Security Administration Commissioner",
"18",
"Social Security Hearing",
" ",
"Hearing"
],
[
"01:00 PM",
"Social Security Administration Commissioner",
"18",
"Order Setting Social Security Hearing",
" ",
"Hearing"
],
[
"3rd flr (100)"
],
[
"09:30 AM",
"TERMINATED on 03/23/2015",
"34",
"Resetting Hearings",
" ",
"Hearing"
],
[
" ",
"Reserved for case",
"23",
"Motion Hearing",
" ",
"Hearing"
],
[
"01:00 PM",
"Case Information",
"19",
"Order Setting",
" ",
"Hearing"
],
[
"01:30 PM",
"Case information",
"31",
"Order Setting",
" ",
"Hearing"
],
[
" ",
"TERMINATED on 06/14/2019",
"16",
"Order Setting/Resetting Hearings",
" ",
"Hearing"
],
[
"3rd flr (Rm 310)"
],
[
"01:30 PM",
"Insurance Company",
"122",
"Order Setting/Resetting Hearings",
" ",
"Hearing"
]
]
}
<center><Table border=1 width=98%>
<TR><TD id='report' class='report' align=center><B><FONT SIZE=+2>Daily Calendar Report of 09/23/2019</font></B><BR><CENTER></table></center>
<Table border=1 width=98% >
<TR><TD class='room' id='room' ALIGN=CENTER COLSPAN=6><STRONG>2nd flr (Rm 226)</STRONG></TD></TR>
<TR id='casedata' class='casedata'>
<TD class=case-0 id=case-0 VALIGN=top NOWRAP>10:00 AM</TD>
<TD class=case-1 id=case-1 VALIGN=top><A HREF=/Reportpt.pl?55244>Social Security Administration</A><B></B></TD>
<TD class=case-2 id=case-2 VALIGN=top>18</TD>
<TD class=case-3 id=case-3 VALIGN=top>Security Hearing</TD>
<TD class=case-4 id=case-4 VALIGN=top> </TD>
<TD class=case-5 id=case-5 VALIGN=top NOWRAP><I>Hearing</I></TD>
</TR>
<TR><TD class='room' id='room' ALIGN=CENTER COLSPAN=6><STRONG>2nd flr (Rm 406)</STRONG></TD></TR>
<TR id='casedata' class='casedata'>
<TD class=case-0 id=case-0 VALIGN=top NOWRAP>1:30 PM</TD>
<TD class=case-1 id=case-1 VALIGN=top><A HREF=/Reportpt.pl?55244>Social Security Administration</A><B></B></TD>
<TD class=case-2 id=case-2 VALIGN=top>18</TD>
<TD class=case-3 id=case-3 VALIGN=top>Security Hearing</TD>
<TD class=case-4 id=case-4 VALIGN=top> </TD>
<TD class=case-5 id=case-5 VALIGN=top NOWRAP><I>Hearing</I></TD>
</TR>
</table>
const tds = Array.from(document.querySelectorAll('#Content > table > tbody > tr > td'));
const trs = Array.from(document.querySelectorAll('#Content > table > tbody > tr'))
const data = Array.from(
document.querySelectorAll('#Content > table > tbody > tr'),
row => Array.from(row.querySelectorAll('td'), cell => cell.innerText),
data =>{ return ( [data] ) }
)
【问题讨论】:
-
看起来
innertText并不总能得到你想要的。如果你能提供你试图从中抓取的 HTML,那将会很有帮助。
标签: javascript web-scraping puppeteer