【问题标题】:rvest: language selection not working in tripadvisorrvest:语言选择在tripadvisor中不起作用
【发布时间】:2021-04-02 08:14:03
【问题描述】:

我正面临网络抓取问题。我打算在tripadvisor上刮几个cmets。我想使用rvest 并获得所有语言的 cmets。从this questions 我了解到一种可能的方法是在网址末尾使用?filterLang=ALL。在网络浏览器中,它确实有效。示例:

https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL

是否提供选择了“所有语言”的 cmets(您可以看到很多法语 cmets)。这是我的问题:我尝试获取评论的标题:

library(rvest)
url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"

reviews_html <- read_html(url)

reviews_html %>%
  html_nodes(xpath = "//span[@class='noQuotes']") %>%
  html_text()

 [1] "I've never visited this restaurant," "Perfect"                            
 [3] "Memorable experience"                "Tasty"                              
 [5] "Absolutely spectacular"              "Excellent"                          
 [7] "Wonderfullll"                        "A Perfect Evening"                  
 [9] "Dinner "                             "Perfect dinner and evening" 

我只买了英文的。奇怪的是:如果我尝试获取页数:

reviews_html %>%
  html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
  html_text()

[1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "176"

我有“所有语言”选择对应的评论页数!如果与没有语言选择的情况进行比较

url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html"

reviews_html <- read_html(url)

reviews_html %>%
  html_nodes(xpath = "//span[@class='noQuotes']") %>%
  html_text()

 [1] "I've never visited this restaurant," "Perfect"                            
 [3] "Memorable experience"                "Tasty"                              
 [5] "Absolutely spectacular"              "Excellent"                          
 [7] "Wonderfullll"                        "A Perfect Evening"                  
 [9] "Dinner "                             "Perfect dinner and evening" 

我得到相同的 cmets,但是:

reviews_html %>%
  html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
  html_text()

[1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "61" 

我得到与英语选择相对应的页数。 我也尝试设置 cookie:

library(httr)

url <- "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html?filterLang=ALL"
httr::GET(url, 
          set_cookies(`TALanguage` = "ALL",
                      `Domain` = ".tripadvisor.com"))%>%
  read_html()%>%
  html_nodes(xpath = "//span[@class='noQuotes']") %>%
  html_text()

但它也没有工作。 有谁了解发生了什么,以及我可以做些什么来真正使用 rvest 获得所有语言的 cmets?

【问题讨论】:

    标签: r web-scraping rvest


    【解决方案1】:

    当您手动选择过滤器时,会在同一网址上调用POST。在表单正文中设置filterLang=ALL 正确返回数据:

    library(rvest)
    library(httr)
    
    reviews_html <- POST(
        "https://www.tripadvisor.com/Restaurant_Review-g187147-d2013853-Reviews-114_Faubourg-Paris_Ile_de_France.html",
        add_headers('x-requested-with'= 'XMLHttpRequest'),
        body = list(
          preferFriendReviews = "FALSE",
          t = "",
          q = "", # filter by mention, try "france"
          filterSeasons = "", # "1" is mar-may / "2" is jun-aug / "3" is sep-nov / "4" is dec-feb
          filterLang = "ALL", # try "zhCN" or "fr"
          filterSafety = "FALSE",
          filterSegment = "", # "3" is families / "2" is couples / "5" is solo / "1" is business / "4" is friends
          trating = "", # stars: "5" / "4" / "3" / "2" / "1" / "0"
          isLastPoll = "false",
          changeSet = "REVIEW_LIST"
        ), 
        encode = "form") %>%
        read_html()
    
    reviews <- reviews_html %>%
        html_nodes(xpath = "//span[@class='noQuotes']") %>%
        html_text()
    
    print(reviews)
    
    pages  <- reviews_html %>%
      html_nodes(xpath = "//div[@data-tab='TABS_REVIEWS']//a[@data-page-number]")%>%
      html_text()
    
    print(pages)
    

    在上面的代码中,如果你需要这些过滤器,我已经添加了一些关于字段的描述

    kaggle link

    输出:

     [1] "I've never visited this restaurant," "Excellente expérience"              
     [3] "Du grand art"                        "Promesse tenue"                     
     [5] "Une soirée de rêve en famille"       "Délicieux !!! "                     
     [7] "Une expérience inoubliable"          "UN CERTAIN REGARD"                  
     [9] "Excellent soiree en couple"          "Une soirée magnifique"              
    [1] "Next" "1"    "2"    "3"    "4"    "5"    "6"    "176"
    

    【讨论】:

    • 谢谢!完美的过滤器描述。你知道为什么 url 没有诀窍吗?
    • @denis 实际上,R 中的行为是正确的,当您打开隐身窗口并转到 tripadvisor.com/… 时,评论看起来与您帖子中的完全一样。然后,如果您刷新页面,评论是正确的。它与第一次设置的 cookie 和会话机制有关。我没有进一步调查,因为有一种更简单的方法可以继续(并且在单个请求中)。
    • @denis 如果之前的行为不同,这可能是 UI 或服务器端的错误
    猜你喜欢
    • 1970-01-01
    • 2014-03-14
    • 2017-03-20
    • 1970-01-01
    • 1970-01-01
    • 2018-09-19
    • 1970-01-01
    • 2016-10-08
    • 1970-01-01
    相关资源
    最近更新 更多