【问题标题】:Extract html table immediately following specified text在指定文本之后立即提取 html 表
【发布时间】:2015-11-06 23:48:08
【问题描述】:

我正在尝试从网页中抓取 html 表格。但是,该页面包含许多我不想抓取的 html 表。为了识别我要抓取的表,我想使用特定单词组合之后的第一个表(单词组合不在表中,而是文本的一部分)。这是一个例子:

这是我感兴趣的表格:

library(XML)
url <- "http://www.sec.gov/Archives/edgar/data/1301063/000119312514133663/0001193125-14-133663.txt"
readHTMLTable(url, trim = T, header = F, stringsAsFactors = F)[29]

我想用来检测表格的标准是它是遵循此单词组合的第一个表格:

“安全、健康、环境和可持续性挑战”

html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
grep("safety, health, environmental and sustainability challenges", text, value = T)

【问题讨论】:

    标签: html r xpath web-scraping html-table


    【解决方案1】:

    我想这就是你要找的东西:

    xpathSApply(doc,'//text()[contains(.,"safety, health, environmental and sustainability challenges")]/following::table[1]');
    ## <table cellspacing="0" cellpadding="0" width="100%" border="0" style="BORDER-COLLAPSE:COLLAPSE" align="center">
    ##   <tr><td width="48%"/>
    ## <td valign="bottom" width="12%"/>
    ## <td/>
    ## <td/>
    ## <td/>
    ## <td valign="bottom" width="12%"/>
    ## <td/>
    ## <td/>
    ## <td/>
    ## <td valign="bottom" width="12%"/>
    ## <td/>
    ## <td/>
    ## <td/>
    ## <td valign="bottom" width="12%"/>
    ## <td/>
    ## <td/>
    ## <td/></tr>
    ##   <tr><td valign="bottom" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"> <p style="margin-top:0px;margin-bottom:1px" align="center"><font style="font-family:Times New Roman" size="1"><b>Name</b></font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Audit<br/>Committee</b></font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Compensation<br/>Committee</b></font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Nominating and<br/>Corporate<br/>Governance<br/>Committee</b></font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Safety, Health,<br/>Environmental and<br/>Sustainability<br/>Committee</b></font></td>
    ## <td valign="bottom"><font size="1"> </font></td></tr>
    ##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Kevin S. Crutchfield</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(1)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
    ##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Angelo C. Brisimitzakis</font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
    ##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">William J. Crowley, Jr.</font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
    ##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">E. Linn Draper, Jr.</font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/></tr>
    ##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Glenn A. Eisenberg</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(2)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
    ##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Deborah M. Fretz</font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/></tr>
    ##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">P. Michael Giftos</font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td></tr>
    ##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">L. Patrick Hassey</font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/></tr>
    ##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Joel Richards, III</font></p></td>
    ## <td valign="bottom"><font size="1">  </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"/>
    ## <td valign="bottom"><font size="1"> </font></td>
    ## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
    ## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
    ## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
    ## </table>
    

    【讨论】:

    • 像魅力一样工作。谢谢!
    猜你喜欢
    • 2019-01-11
    • 1970-01-01
    • 1970-01-01
    • 2019-06-11
    • 2016-08-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-08-18
    相关资源
    最近更新 更多