【问题标题】:Using rvest or xml to parse link to a next page?使用 rvest 或 xml 解析指向下一页的链接?
【发布时间】:2016-11-06 15:22:08
【问题描述】:

致力于学习从网站上抓取数据。我一直在玩 rvest 包,并掌握了如何使用选择器小工具等提取节点。对于一个快速项目,我希望从飞行网站中提取数据,将其转换为我的数据框以后可以子集并通过电子邮件发送给我有用的航班。我正在使用的代码都在下面。

library(rvest)
reg = paste("http://www.secretflying.com/usa-deals/") 

#read the text from the flight deal-----------
fly_deals = read_html(reg)
fly_deals = html_nodes(fly_deals, ".entry-title a")
fly_deals = html_text(fly_deals)
fly_deals = as.data.frame(fly_deals)

#add link (not sure how to access the link)
fly_deals$correpsonding_link = 'corresponding_link'

#last step would filter out for NYC
fly_deals = fly_deals[grepl("NEW YORK", fly_deals$fly_deals),]

我现在想做的是访问与每一行(也就是每个节点)关联的页面,这样我就可以使用相应的链接构建另一列,该链接可以直接从我的电子邮件中访问。因此最终产品看起来像这样:

感谢任何帮助!

【问题讨论】:

    标签: r web-scraping rvest


    【解决方案1】:

    试试:

    library(rvest)
    
    deals_link <- "http://www.secretflying.com/usa-deals/"
    deals_info <- deals_link %>% read_html() %>%
      html_nodes(".entry-title a")
    
    fly_deals <- data.frame(deals = html_text(deals_info), correpsonding_link = html_attr(deals_info,"href"))
    
    fly_deals[grepl("NEW YORK", fly_deals$deals),]
    

    输出:

     deals                                                                  
     NON-STOP FROM NEW YORK TO CARTAGENA, COLOMBIA FOR ONLY $328 ROUNDTRIP  
     XMAS & NEW YEAR: NEW YORK TO THE TURKS & CAICOS FOR ONLY $231 ROUNDTRIP
     NEW YORK TO BOSTON (& VICE VERSA) FOR ONLY $66 ROUNDTRIP               
     correpsonding_link                                                         
     http://www.secretflying.com/2016/new-york-cartagena-colombia-296-roundtrip/
     http://www.secretflying.com/2016/hot-new-york-turks-caicos-58-one-way/     
     http://www.secretflying.com/2016/new-york-boston-vice-versa-66-roundtrip/ 
    

    我希望这会有所帮助。

    【讨论】:

    • 这太完美了。我看到 html_attr 是缺少的基本部分。谢谢!!
    猜你喜欢
    • 1970-01-01
    • 2012-08-02
    • 2018-07-19
    • 2015-05-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-03-27
    • 1970-01-01
    相关资源
    最近更新 更多