使用 rvest 抓取网页答案

【问题标题】：using rvest to scrape webpage使用 rvest 抓取网页
【发布时间】：2019-02-17 14:08:43
【问题描述】：

我正试图在泰晤士高等教育上刮桌子

我使用了以下代码，但结果是一个空表。我做错了什么？

pacman::p_load(rvest)

webpage <- read_html(paste0('https://www.timeshighereducation.com/rankings/', 
                            'united-states/2018#!/page/0/length/-1/sort_by/', 
                            'stats_salary/sort_order/desc/cols/stats'))


d <- html_nodes(webpage, xpath = '//table') %>% 
  html_table()

d

[[1]]
 [1] rank order           Rank                  Name                  Node ID              
 [5] Overall                                     Resources                                  
 [9] Engagement                                  Outcomes                                   
[13] Environment                                                                            
[17]                                                                                        
[21] Tuition and Fees      Room and Board        Salary after 10 years
<0 rows> (or 0-length row.names)

【问题讨论】：

标签： html r xpath rvest

【解决方案1】：

我找到了数据！事实证明，timeshighereducation.com 使用 javascript 来调用数据，因此使用典型的 rvest 例程是行不通的。

我发现下面的链接有助于了解如何使用 javascript 处理显示数据的网页：rvest and V8

我的第一个想法是查看哪个节点返回我想要的脚本。它似乎是列表中的第 9 位。然后我将其转换为 html 文本。

t <- html_nodes(webpage, 'script') %>% 
  '['(9) %>% 
  html_text()

在进一步检查html文本后，我发现脚本中有一个json文件。如果我在 Chrome 中输入 url，我实际上可以看到数据。

因此，使用许多处理 JSON 的可用 R 包来获取数据似乎很容易。我选择了jsonlite。很简单，只需 5 行代码即可获取数据。我现在很高兴:)

library(jsonlite)
college_json <- fromJSON(paste0(
  'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/', 
  'united_states_rankings_2018_limit0_efdb24148bae97278bbfe6ecfd71cdd9.json'))

college_dat <- college_json$data

【讨论】：