R - 使用 rvest 抓取 Google + 评论答案

【问题标题】：R - Using rvest to scrape Google + reviewsR - 使用 rvest 抓取 Google + 评论
【发布时间】：2018-05-04 21:21:42
【问题描述】：

作为项目的一部分，我正在尝试从 Google + 上抓取完整的评论（在之前在其他网站上的尝试中，我的评论被 More 截断，除非你点击它，否则它会隐藏完整的评论）。

我为此选择了 rvest 包。但是，我似乎没有得到我想要的结果。

这是我的步骤

library(rvest)
library(xml2)
library(RSelenium)

queens <- read_html("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")

#Here I use the selectorgadget tool to identify the user review part that I wish to scrape

reviews=queens %>%
html_nodes(".review-snippet") %>%
html_text()

但是这似乎不起作用。我在这里没有得到任何输出。

我对这个包和网络抓取非常陌生，因此非常感谢任何关于此的输入。

【问题讨论】：

这违反了 Google 的服务条款。

标签： r web-scraping rvest rselenium

【解决方案1】：

这是使用 RSelenium 和 rvest 的工作流程：
1. 随时向下滚动以获得尽可能多的内容，记得暂停一次以让内容加载。
2. 点击所有“点击更多”按钮并获得完整评论。
3.获取pagesource并使用rvest获取列表中的所有reveiws

你要抓取的不是静态的，所以你需要 RSelenium 的帮助。这应该有效：

library(rvest)
library(xml2)
library(RSelenium)

rmDr=rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
myclient= rmDr$client
myclient$navigate("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=20
for(i in 1 :scroll_down_times){
    webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
    #the content needs time to load,wait 1 second every 5 scroll downs
    if(i%%5==0){
        Sys.sleep(1)
    }
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
    tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
    html_nodes(".review-full-text") %>%
    html_text()

#number of stars
stars <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes("g-review-stars > span") %>%
    html_attr("aria-label")


#time posted
post_time <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes(".dehysf") %>%
    html_text()

【讨论】：

在运行 rmDr=rsDriver(browser = "chrome") 时，我似乎收到一条错误消息，上面写着“httr 调用中的未定义错误”。 httr 输出：无法连接到 localhost 端口 4567：连接被拒绝'
有时会出现这种情况，4567L 端口可能已经被其他应用程序占用了，您可以尝试其他端口，例如4444L 或4445L。
现在在运行pagesource= myclient$getPageSource()[[1]] 时，我收到另一条错误消息，上面写着chrome not reachable\n (Session info: chrome=xxxxxx)。知道如何解决这个错误吗？
我没有这个错误。您现在使用的是哪个版本的 R？最近，由于 3.5 更新，Rselenium 的依赖包不再可用。如果您使用的是最新版本。您可以先切换到旧版本。
严逸夫您好，我最近发布了一个问题，基于您在这里回答的问题。我希望你能帮忙。这是问题的链接。谢谢stackoverflow.com/questions/50680985/…