【问题标题】:XPath 1.0 expression returns NULLXPath 1.0 表达式返回 NULL
【发布时间】:2014-11-16 10:25:47
【问题描述】:

从这个网站,http://www.lewisthomason.com/locations/这部分HTML代码有我要提取的,即公司办公室所在的四个城市(诺克斯维尔、孟菲斯、纳什维尔和塞维尔维尔)

<div id="the_content">
<div class="one_fourth">
<h3>
<cufon class="cufon cufon-canvas" alt="KNOXVILLE" style="width: 87px; height: 26px;">
<canvas width="104" height="25" style="width: 104px; height: 25px; top: -1px; left: 0px;"></canvas>
<cufontext>KNOXVILLE</cufontext>
</cufon>
</h3>
<p>
<h6>
</div>
<div class="one_fourth">
<div class="one_fourth">
<div class="one_fourth last">
<div class="clearboth"></div>
<p></p>
</div>
</div>
<div id="secondary"> </div>
<div class="clearboth"></div>
</div>

我已经尝试了这些 XPath 搜索的几种变体

require(XML)
require(httr)
doc <- content(GET('http://www.lewisthomason.com/locations/'))

xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)

我得到的都是 NULL。什么表达会带回城市名称或整个地址?我知道第四个城市有,所以我将修改最后的表达。

感谢您的指导。

【问题讨论】:

    标签: html r xpath html-parsing rvest


    【解决方案1】:

    rvest 通过 CSS 选择器进行救援(XPath 也可以):

    library(rvest) # for scraping
    library(httr)  # only for user_agent()
    
    pg <- html_session("http://www.lewisthomason.com/locations/", 
                       user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))
    
    # get names
    pg %>% html_nodes("h3") %>% html_text()
    
    ## [1] "KNOXVILLE"   "MEMPHIS"     "NASHVILLE"   "SEVIERVILLE"
    
    # get locations
    pg %>% html_nodes("h3~p") %>% html_text() %>% .[1:4]
    
    ## [1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
    ## [2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
    ## [3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
    ## [4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"  
    

    【讨论】:

    • 一个包装器到一个包装器到一个包装器;)
    • 确实 :-) 虽然这应该让人们更容易获取数据,尤其是小插图中包含的 SelectorGadget 小书签 Hadley。它也非常适合整个新的“管道”时尚。
    • BTW rvest 从 magrittr 导入 %>%,所以你不需要 dplyr
    • @hadley,谢谢。我只是经常使用这三个library 调用,以至于我现在只是死记硬背地输入它们:-)
    【解决方案2】:

    网站正在检查用户代理。如果你给它一个合适的用户代理,它会向你发送正确的内容:

    require(XML)
    require(RCurl)
    myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
    doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
    doc <- htmlParse(doc)
    
    
    > xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
    [1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
    [2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
    [3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
    [4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"                                       
    [5] ""                                                                                                                             
    > xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
    [1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
    [2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
    [3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389" 
    

    否则它正在发送:

    > getURL('http://www.lewisthomason.com/locations/')
    [1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don't have permission to access /locations/\non this server.</p>\n</body></html>\n"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-03-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多