【发布时间】:2021-04-02 08:14:03
【问题描述】:
我正面临网络抓取问题。我打算在tripadvisor上刮几个cmets。我想使用rvest 并获得所有语言的 cmets。从this questions 我了解到一种可能的方法是在网址末尾使用?filterLang=ALL。在网络浏览器中,它确实有效。示例:
是否提供选择了“所有语言”的 cmets(您可以看到很多法语 cmets)。这是我的问题:我尝试获取评论的标题:
library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
reviews_html <- read_html(url)
reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"
我只买了英文的。奇怪的是:如果我尝试获取页数:
reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
[1] "Next" "1" "2" "3" "4" "5" "6" "176"
我有“所有语言”选择对应的评论页数!如果与没有语言选择的情况进行比较
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html"
reviews_html <- read_html(url)
reviews_html %>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
[1] "I've never visited this restaurant," "Perfect"
[3] "Memorable experience" "Tasty"
[5] "Absolutely spectacular" "Excellent"
[7] "Wonderfullll" "A Perfect Evening"
[9] "Dinner " "Perfect dinner and evening"
我得到相同的 cmets,但是:
reviews_html %>%
html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
html_text()
[1] "Next" "1" "2" "3" "4" "5" "6" "61"
我得到与英语选择相对应的页数。 我也尝试设置 cookie:
library(httr)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
httr::GET(url,
set_cookies(`TALanguage` = "ALL",
`Domain` = ".tripadvisor.com"))%>%
read_html()%>%
html_nodes(xpath = "//span[@class='noQuotes']") %>%
html_text()
但它也没有工作。 有谁了解发生了什么,以及我可以做些什么来真正使用 rvest 获得所有语言的 cmets?
【问题讨论】:
标签: r web-scraping rvest