使用 rvest 抓取不在表中的数据答案

【问题标题】：Using rvest to scrape data that is not in table使用 rvest 抓取不在表中的数据
【发布时间】：2020-07-16 06:09:07
【问题描述】：

我正在尝试从网站上抓取一些数据。我以为我可以使用 rvest，但我在获取不在表格中的数据时遇到了很多麻烦。

我不知道这是否可能，或者我是否使用了错误的包？

我正在尝试从以下 html 中获取网站、名称和地址：

<div class="info clearfix">
<i class="sprite icon title"></i>
<p class="title">
<a target="_blank" href="https://test.com/regions/Tennis_Court.html">
Tennis Court</a>
</p>
<p class="location"> 123 Page St, Charlestown</p>                                                <p class="excerpt" itemprop="description">A place to play tennis</p>                                                                                           </div>

我希望我可以使用 html_node("title") 之类的东西，但这似乎没有错。我是不是完全走错了路？

【问题讨论】：

您能否分享您尝试提取数据的网址并说明您要准确提取哪些数据？
@RonakShah 我使用concreteplayground.com/auckland/bars 并试图提取他们页面的名称、地址和链接（例如，第一个是“Holy Hop”、“498 New North Road, Kingsland”和“concreteplayground.com/auckland/bars/holy-hop”。

标签： r rvest

【解决方案1】：

你可以使用html_nodes添加css选择器来提取：

library(rvest)
url <- 'https://concreteplayground.com/auckland/bars'

webpage <- url %>% read_html()
name <- webpage %>% html_nodes('p.name a') %>%html_text() %>% trimws()
address <- webpage %>% html_nodes('p.address') %>% html_text() %>% trimws()
links <- webpage %>% html_nodes('p.name a') %>% html_attr('href')
data.frame(name, address, links)

#                              name                                address
#1                         Holy Hop          498 New North Road, Kingsland
#2                              Sly          354A Karangahape Road, Newton
#...
#...

                                                                      
#                                                                 links
#1                         https://concreteplayground.com/auckland/bars/holy-hop
#2                              https://concreteplayground.com/auckland/bars/sly
#...
#...

【讨论】：