【发布时间】:2015-04-18 16:22:52
【问题描述】:
我在这里寻求有关在 jsoup 中使用模式选择器的帮助 基本上我正在修改别人的代码以满足我的需要
例如对于 href ,它是这样完成的
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
我指的是这里,但不确定使用哪一个 http://jsoup.org/apidocs/org/jsoup/select/Selector.html
我想查找“正在运行的地图任务,1”等值
<hr>
<h2>Cluster Summary (Heap Size is 555 MB/26.6 GB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</th><th>MapTask Prefetch Capacity</th></tr>
<tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></table>
<br>
<hr>
如何获取所有标签中的文本?
我还应该寻找像“集群摘要”这样的标题,这样我就可以从我的其余 URL 中使用或相应地使用它
<h2 id="running_jobs">Running Jobs</h2>
<table border="1" cellpadding="5" cellspacing="0">
<thead><tr><th><b>Jobid</b></th><th><b>Priority</b></th><th><b>User</b></th><th><b>Name</b></th><th><b>Start Time</b></th><th><b>Map % Complete</b></th><th><b>Current Map Slots</b></th><th><b>Failed MapAttempts</b></th><th><b>MapAttempt Time Avg/Max</b></th><th><b>Cumulative Map CPU</b></th><th><b>Current Map PMem</b></th><th><b>Reduce % Complete</b></th><th><b>Current Reduce Slots</b></th><th><b>FailedReduce Attempts</b></th><th><b>ReduceAttempt Time Avg/Max</b></th><th><b>Cumulative Reduce CPU</b></th><th><b>Current Reduce PMem</b></th></tr>
</thead><tbody><tr><td id="job_0"><a href="jobdetails.jsp?jobid=job_201502130313_1511&refresh=30">job_201502130313_1511</a></td><td id="priority_0">NORMAL</td><td id="user_0">vdeadmin</td><td id="name_0">streamjob1942665573586845283.jar</td><td>Fri Feb 13 17:00:17 PST 2015</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td><a href="jobtasks.jsp?jobid=job_201502130313_1511&type=map&pagenum=1&state=running">1</a></td><td>0</td><td>0sec/0sec</td><td>1hrs, 30mins, 4sec</td><td>703.48 MB</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td>0</td><td>0</td><td>0sec/0sec</td><td>0sec</td><td> 0 KB</td></tr>
问题的更新/补充 我的 URL 将包含长 HTML,我应该能够搜索特定的组。我的意思是我的搜索应该是逐块搜索...我不想从 html 中找到所有 tr th ...但特定于一个表等等 例如在下面,我试图只显示 id="running job" 的结果,然后显示其他一些集合。这样做时,我不应该从 html 的其他部分得到结果
<h2 id="running_jobs">Running Jobs</h2>
<table border="1" cellpadding="5" cellspacing="0">
<thead><tr><th><b>Jobid</b></th><th><b>Priority</b></th><th><b>User</b></th><th><b>Name</b></th><th><b>Start Time</b></th><th><b>Map % Complete</b></th><th><b>Current Map Slots</b></th><th><b>Failed MapAttempts</b></th><th><b>MapAttempt Time Avg/Max</b></th><th><b>Cumulative Map CPU</b></th><th><b>Current Map PMem</b></th><th><b>Reduce % Complete</b></th><th><b>Current Reduce Slots</b></th><th><b>FailedReduce Attempts</b></th><th><b>ReduceAttempt Time Avg/Max</b></th><th><b>Cumulative Reduce CPU</b></th><th><b>Current Reduce PMem</b></th></tr>
</thead><tbody><tr><td id="job_0"><a href="jobdetails.jsp?jobid=job_201502130313_1511&refresh=30">job_201502130313_1511</a></td><td id="priority_0">NORMAL</td><td id="user_0">vdeadmin</td><td id="name_0">streamjob1942665573586845283.jar</td><td>Fri Feb 13 17:00:17 PST 2015</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td><a href="jobtasks.jsp?jobid=job_201502130313_1511&type=map&pagenum=1&state=running">1</a></td><td>0</td><td>0sec/0sec</td><td>1hrs, 30mins, 4sec</td><td>703.48 MB</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td>0</td><td>0</td><td>0sec/0sec</td><td>0sec</td><td> 0 KB</td></tr>
</tbody></table>
【问题讨论】:
标签: css-selectors html-parsing jsoup