用 R 解析 CDATA答案

【问题标题】：Parse CDATA with R用 R 解析 CDATA
【发布时间】：2018-03-29 19:53:02
【问题描述】：

我正在从汽车拍卖网站抓取和分析数据。我的目标是培养日期时间和情绪分析技能，而且我喜欢旧车。该网站是Bring A Trailer——他们不提供API 访问（我问过），但robots.txt 是可以的。

SO 用户“42”指出，BAT 的条款不允许这样做，因此我删除了他们的基本网址。我可能会删除这个问题。在考虑之后，我可以通过从浏览器中保存几个网页并分析这些数据来做我想做的事。我不需要所有的拍卖，我只是按照教程做的，在这里我正在阅读 TOS，而不是做我想做的事情......

有些数据很容易访问，但最好的部分很难，我坚持这一点。我真的在寻求有关我的方法的建议。

我的第一步工作：我可以找到并在本地缓存网页：

library(tidyverse)
library(rvest)

data_dir <- "bat_data-html/"

# Step 1: Create list of links to listings ----------------------------
base_url <- "https://"
pages <- read_html(file.path(base_url,"/auctions/")) %>%
  html_nodes(".auctions-item-title a") %>%
  html_attr("href") %>%
  file.path

pages <- head(pages, 3) # use a subset for testing code

# Step 2 : Save auction pages locally ---------------------------------
dir.create(data_dir, showWarnings = FALSE)
p <- progress_estimated(length(pages))

# Download each auction page
walk(pages, function(url){
  download.file(url, destfile = file.path(data_dir, basename(url)), quiet = TRUE)
  p$tick()$print()
})

我还可以从这些缓存页面处理有关拍卖的元数据，使用 SelectorGadget 识别 css 选择器并将它们指定为 rvest：

# Step 3: Process each auction info into df ----------------------------
files <- dir(data_dir, pattern = "*", full.names = TRUE)

# Function: get_auction_details, to be applied to each auction page
get_auction_details <- function(file) {
  pagename <- basename(file) # the filename of the page (trailing index for multiples)
  page <- read_html(file)   # read the html into R ( consider , options = "NOCDATA")
  # Grab the title of the auction stored in the ".listing-post-title" tag on the page
  title <- page %>% html_nodes(".listing-post-title") %>% html_text()
  # Grab the "BAT essentials" of the auction stored in the ".listing-essentials-item" tag on the page
  essence <- page %>% html_nodes(".listing-essentials-item") %>% html_text()
  # Assemble into a data frame
  info_tbl0 <- as_tibble(essence)
  info_tbl <- add_row(info_tbl0, value = title, .before = 1)
  names(info_tbl) [1] <- pagename
  return(info_tbl)
} 

# Apply the get_auction_details function to each element of files

bat0 <- map_df(files, get_auction_details)         # run function
bat <- gather(bat0) %>% subset(value != "NA")      # serialize results

# Save as csv
write_csv(bat, path = "data-csv/bat04.csv") # this table contains the expected metadata:

key,value
1931-ford-model-a-12,Modified 1931 Ford Model A Pickup
1931-ford-model-a-12,Lot #8576
1931-ford-model-a-12,Seller: TargaEng

但拍卖数据（出价、cmets）位于 CDATA 部分中：

<script type='text/javascript'>
/* <![CDATA[ */
var BAT_VMS = { ...bids, comments, results  
/* ]]> */
</script>

我已经使用我使用 SelectorGadget 找到的路径尝试了本节中的元素，但没有找到它们 - 这给出了一个空列表：

tmp <- page %>% html_nodes(".comments-list") %>% html_text()

查看此 CDATA 部分中的文本，我看到一些 xml 标记，但它不像我检查实时网页的拍卖部分时那样在缓存文件中结构化。

要提取此信息，我应该尝试“按原样”解析此 CDATA 部分中的信息，还是可以对其进行转换以便像 XML 一样对其进行解析？还是我找错树了？

感谢任何建议！

【问题讨论】：

我认为您还需要确定服务条款对自动下载内容的规定。 robots.txt 实际上只是它们允许蜘蛛/爬虫索引站点的标志。 “条款”页面显示没有抓取。
这太令人失望了，我想我需要找到另一个拍卖网站。
感谢您的回复。我想我应该记下这个问题。

标签： r xml web-scraping rvest

【解决方案1】：

它隐藏在 XML2 文档中，但您可以使用此选项来保持 CDATA 完整。

# Instead of rvest::read_html()
xml2::read_xml(options = "NOCDATA")

以这种方式阅读提要后，您将能够以您想要的方式访问 cmets 列表。

tmp <- page %>% html_nodes(".comments-list") %>% html_text()

【讨论】：