【问题标题】:Separate strings returned by html_text in rvestrvest 中的 html_text 返回的单独字符串
【发布时间】:2020-02-13 18:18:19
【问题描述】:

我正在尝试使用 rvest 为酒店提取设施。

library(rvest)
hotel_url="https://www.tripadvisor.com/Hotel_Review-g187791-d13494726-Reviews-Palazzo_Caruso-Rome_Lazio.html"
amenities<-hotel%>%
    html_node(".hotels-hr-about-amenities-AmenityGroup__amenitiesList--3MdFn")%>%
    html_text()

生成的文本不会将一种便利设施与另一种分开:

[1] "附近有收费私人停车场免费高速上网 (WiFi)咖啡店自行车之旅徒步旅行租车服务传真/复印行李寄存免费网络Wifi公共wifi上网提供早餐客房内早餐礼宾服务行政酒廊无烟酒店阳光露台24 小时前台私人入住/退房洗衣服务"

有没有办法在便利设施之间添加分隔符(例如“;”)?

【问题讨论】:

  • gsub("([a-z])([A-Z])", "\\1 \\2", string)。我知道(Wi Fi)Coffee 仍然是个问题,但也许你会找到答案。我在手机自动取款机上。

标签: r web-scraping rvest stringr


【解决方案1】:

您需要在 html 结构中更深一层或两层才能将文本作为列表拉出。使用html_children() 函数可以做到这一点。
详情见 cmets:

library(rvest)
hotel_url="https://www.tripadvisor.com/Hotel_Review-g187791-d13494726-Reviews-
Palazzo_Caruso-Rome_Lazio.html"
hotel<-read_html(hotel_url)

amenities<-hotel%>%
  html_node(".hotels-hr-about-amenities-AmenityGroup__amenitiesList--3MdFn")%>% 
       html_children()

#last child node is the unhighlighted amenities
#get text for highlighted amenities
highlighted<-amenities[xml_length(amenities)==1] %>% html_text()
#drill down 1 more level for unhighlighted amenities
unhighlighted<-amenities[xml_length(amenities)>1] %>% html_children() %>% html_text()



> highlighted
[1] "Paid private parking nearby"     "Free High Speed Internet (WiFi)" "Coffee shop"                     "Bicycle tours"                  
[5] "Walking tours"                   "Car hire"                        "Fax / photocopying"              "Baggage storage"                
> unhighlighted
 [1] "Free internet"                "Wifi"                         "Public wifi"                  "Internet"                    
 [5] "Breakfast available"          "Breakfast in the room"        "Concierge"                    "Executive lounge access"     
 [9] "Non-smoking hotel"            "Sun terrace"                  "24-hour front desk"           "Private check-in / check-out"
[13] "Laundry service" 

【讨论】:

    猜你喜欢
    • 2020-04-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-12-13
    • 2019-12-08
    • 2018-09-15
    • 1970-01-01
    相关资源
    最近更新 更多