数据锚文本 - Web-scraping rvest 问题答案

【问题标题】：data-anchor text - Web-scraping rvest question数据锚文本 - Web-scraping rvest 问题
【发布时间】：2021-08-27 20:21:34
【问题描述】：

我正在尝试从该页面抓取：https://www.scielo.br/j/rcf/a/M6Ck7FmWQvm8nTCWkLBXLhp/?lang=pt

我需要从这个页面中抓取更多类似的页面，但模式不一样。我可以通过这个 xpath - //*[@id="articleText"]/div[1] 刮取文本，但实际上我想从 div 刮取 - class="articleSection";数据锚名称“文本”。

div 编号在链接上发生变化，但模式数据-锚名称“文本”，没有。

我添加这张图片是为了提供一些背景信息：

R 代码：

library(dplyr)
library(rvest)

article <- "https://www.scielo.br/j/rcf/a/h9fbHLPbwgRVymxmtxNhKJR/?lang=pt&format=html" # link

aticle_text <- article %>%
  rvest::read_html() %>% 
  rvest::html_node(xpath='//*[@id="articleText"]/div[1]') %>% # here I would like to scrape from data-anchor name "Text", inside the div Article Section
  rvest::html_text()

【问题讨论】：

标签： html r web-scraping rvest

【解决方案1】：

您可以使用属性=值 css 选择器来匹配属性

]library(magrittr)
library(rvest)

article <- "https://www.scielo.br/j/rcf/a/h9fbHLPbwgRVymxmtxNhKJR/?lang=pt&format=html" # link

article_text <- article %>%
  rvest::read_html() %>% 
  rvest::html_node('[data-anchor=Text]') %>% 
  rvest::html_text2()

【讨论】：

【解决方案2】：

我认为，这个 XPath 解决了你的问题

//*[contains(@class,'articleSection') and @data-anchor='Text']

【讨论】：