Vivino - 用 R 刮擦答案

【问题标题】：Vivino - Scraping with RVivino - 用 R 刮擦
【发布时间】：2021-05-06 03:52:19
【问题描述】：

我想从 Vivino 抓取有关葡萄酒的基本数据。我以前从未做过抓取，但基于 Datacamp 上的一些教程和讲座，我尝试使用库 rvest 使用基本代码。但是，它似乎不起作用并返回零值。谁能帮助我并告诉我，问题出在哪里？代码是完全错误的，我应该使用其他方法，还是我只是遗漏了什么并且做错了？提前感谢您的任何回答！

library(rvest)
library(dplyr)

url <- 'https://www.vivino.com/explore?e=eJwNybEOQDAQBuC3ubkG4z-abMQkIqdO00RbuTbF2_OtX1A0FHyEocAPWmPIvhh7suimga5_3YHK6qXwSWmDcvHR5ZWrKDuhhF2ypbvMC5oP96QajA%3D%3D&cart_item_source=nav-explore'
web <- read_html(url)

winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()

【问题讨论】：

标签： html r web-scraping screen-scraping rvest

【解决方案1】：

页面动态加载，这就是单独rvest 不起作用的原因；您还需要使用RSelenium。

假设我使用 Firefox，下面的代码应该可以工作：

# RSelenium with Firefox
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate(url)

# Scroll down a couple of times to reach the bottom of the page
# so that additional data load dynamically with each scroll.
# Here I scroll 4 times, but perhaps you will need much more than that.
for(i in 1:4){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])

# close RSelenium
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

# now we can go on to our rvest code and scrape the data
winery_data <- web %>% html_nodes('.vintageTitle__winery--2YoIr') %>% html_text()
head(winery_data)
wine_name <- web %>% html_nodes('.vintageTitle__wine--U7t9G') %>% html_text()
wine_country <- web %>% html_nodes('.vintageLocation__anchor--T7J3k+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_region <- web %>% html_nodes('span+ .vintageLocation__anchor--T7J3k') %>% html_text()
wine_rating <- web %>% html_nodes('.vivinoRating__averageValue--3Navj') %>% html_text()
n_ratings <- web %>% html_nodes('.vivinoRating__caption--3tZeS') %>% html_text()

【讨论】：

帮助很大。它完美地工作。非常感谢！还有一个问题要明确：您使用了循环：for(i in 1:4) 究竟是什么意思 1:4？或其他方式：例如，如果我使用 1:10 会发生什么变化？这个循环有什么意义？谢谢！
很高兴它成功了！哦，这意味着脚本向下滚动 4 次（然后通过 Sys.sleep(3) 再等待 3 秒）。首先，当 i 为 1 时，然后当 i 为 2 时，然后当 i 为 3 时，直到 i 为 4，因此四次。如果你写1:10，那么它会重复滚动等待10次（这在你的情况下甚至可能更好，因为你抓取的页面似乎很长......如果它甚至不是无限滚动的情况） .
当我在 Mac 上运行您的代码时，我收到此错误：sh: taskkill: command not found