在指定文本之后立即提取 html 表答案

【问题标题】：Extract html table immediately following specified text在指定文本之后立即提取 html 表
【发布时间】：2015-11-06 23:48:08
【问题描述】：

我正在尝试从网页中抓取 html 表格。但是，该页面包含许多我不想抓取的 html 表。为了识别我要抓取的表，我想使用特定单词组合之后的第一个表（单词组合不在表中，而是文本的一部分）。这是一个例子：

这是我感兴趣的表格：

library(XML)
url <- "http://www.sec.gov/Archives/edgar/data/1301063/000119312514133663/0001193125-14-133663.txt"
readHTMLTable(url, trim = T, header = F, stringsAsFactors = F)[29]

我想用来检测表格的标准是它是遵循此单词组合的第一个表格：

“安全、健康、环境和可持续性挑战”

html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
grep("safety, health, environmental and sustainability challenges", text, value = T)

【问题讨论】：

标签： html r xpath web-scraping html-table

【解决方案1】：

我想这就是你要找的东西：

xpathSApply(doc,'//text()[contains(.,"safety, health, environmental and sustainability challenges")]/following::table[1]');
## <table cellspacing="0" cellpadding="0" width="100%" border="0" style="BORDER-COLLAPSE:COLLAPSE" align="center">
##   <tr><td width="48%"/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/></tr>
##   <tr><td valign="bottom" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"> <p style="margin-top:0px;margin-bottom:1px" align="center"><font style="font-family:Times New Roman" size="1"><b>Name</b></font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Audit<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Compensation<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Nominating and<br/>Corporate<br/>Governance<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Safety, Health,<br/>Environmental and<br/>Sustainability<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Kevin S. Crutchfield</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(1)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Angelo C. Brisimitzakis</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">William J. Crowley, Jr.</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">E. Linn Draper, Jr.</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Glenn A. Eisenberg</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(2)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Deborah M. Fretz</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">P. Michael Giftos</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">L. Patrick Hassey</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Joel Richards, III</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
## </table>

【讨论】：

像魅力一样工作。谢谢！