R中的HTML表格抓取答案

【问题标题】：HTML table scraping in RR中的HTML表格抓取
【发布时间】：2019-03-01 18:07:31
【问题描述】：

我正在尝试通过以下网址获取表格：https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-31%20v2019%2003%2001_01%2000%2001.html

问题是该表不是 html 表，所以 html_table() 不起作用。

到目前为止，我已经尝试从表中提取节点，但它没有返回任何内容

url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
webpage <- read_html(url)
table_html <- html_nodes(webpage, 'table#Tabc')
table <- html_table(table_html)

【问题讨论】：

标签： html r web web-scraping

【解决方案1】：

所以这里的问题是页面是通过 javascript 呈现的。因此，单独使用 rvest 是行不通的。最简单的方法之一是使用无头网络浏览器。我们可以使用PhantomJS。

首先，下载适当版本的PhantomJS 并将可执行文件（假设是Windows）放在您的工作目录中。也就是说，phantomjs.exe 位于R 脚本的工作目录中。

创建一个scrape.js 文件：

// scrape.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'page.html';

page.open('https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html', function (status) {
  var content = page.content;
  fs.write(path,content,'w');
  phantom.exit();
});

这个scrape.js 文件一旦运行，就会在您的工作目录中创建一个page.html 文件。回到R 或RStudio，您可以执行以下操作：

library(tidyverse)
library(rvest)

# Run scrape.js with PhantomJS to create the file page.html
system("./phantomjs scrape.js")

# Now we should be in business as usual:
read_html('page.html') %>%
  html_nodes("table#Tabc") %>%
  html_table(header = TRUE) %>%
  .[[1]] %>%
  as_tibble()

# A tibble: 504 x 38
   Codigo `Estatus asigna~  Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
   <chr>  <chr>            <int>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1 BTY5W~ ECO                  1               35               20           43212.              1.5            1762.              1.5
 2 BTY5W~ ECO                  2               35               20           43212.              1.5            1762.              1.5
 3 BTY5W~ ECO                  3               35               20           43212.              1.5            1762.              1.5
 4 BTY5W~ ECO                  4               35               20           43212.              1.5            1762.              1.5
 5 BTY5W~ ECO                  5               35               20           43212.              1.5            1762.              1.5
 6 BTY5W~ ECO                  6               35               20           43212.              1.5            1762.              1.5
 7 BTY5W~ ECO                  7               35               20           43212.              1.5            1762.              1.5
 8 BTY5W~ ECO                  8               35               20           43212.              1.5            1762.              1.5
 9 BTY5W~ ECO                  9               35               20           43212.              1.5            1762.              1.5
10 BTY5W~ ECO                 10               35               20           43212.              1.5            1762.              1.5
# ... with 494 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
#   `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
#   ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
#   (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
#   Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
#   09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
#   generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
#   rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
#   min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
#   suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
#   regulacion secundaria ($/MW` <dbl>

更新为扩展到多个 URL

首先，更改scrape.js 文件以接受参数：

// scrape2.js

var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;

var fs = require('fs');
var path = args[2];

page.open(args[1], function (status) {
  var content = page.content;
  fs.write(path,content,'w');
  phantom.exit();
});

接下来，创建列表以循环/遍历/映射（显然这可以被清理/抽象以更易于维护并且需要更少的输入）：

urls <- list(
  'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html',
  'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html',
  'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html'
)

paths <- list(
  'page1.html',
  'page2.html',
  'page3.html'
)

args_list <- map2(urls, paths, paste)

# We are only using this function for the file creation side-effects,
# so we can use walk instead of map. 
# This creates the files: page1.html, page2.html, and page3.html 
walk(args_list, ~ system(paste("./phantomjs scrape2.js", .)))

此时，您可能希望将抓取的内容放入函数中：

read_page <- function(page) {
  read_html(page) %>%
    html_nodes("table#Tabc") %>%
    html_table(header = TRUE) %>%
    .[[1]] %>%
    as_tibble()
}

您可以从那里重复使用路径列表来映射您的新功能：

paths %>%
  map(~ read_page(.)) %>%
  bind_rows()

# A tibble: 9,000 x 38
   Codigo `Estatus asigna~  Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
   <chr>  <chr>            <int>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
 1 BTY5W~ ECO                  1               35               20           43212.              1.5            1762.              1.5
 2 BTY5W~ ECO                  2               35               20           43212.              1.5            1762.              1.5
 3 BTY5W~ ECO                  3               35               20           43212.              1.5            1762.              1.5
 4 BTY5W~ ECO                  4               35               20           43212.              1.5            1762.              1.5
 5 BTY5W~ ECO                  5               35               20           43212.              1.5            1762.              1.5
 6 BTY5W~ ECO                  6               35               20           43212.              1.5            1762.              1.5
 7 BTY5W~ ECO                  7               35               20           43212.              1.5            1762.              1.5
 8 BTY5W~ ECO                  8               35               20           43212.              1.5            1762.              1.5
 9 BTY5W~ ECO                  9               35               20           43212.              1.5            1762.              1.5
10 BTY5W~ ECO                 10               35               20           43212.              1.5            1762.              1.5
# ... with 8,990 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
#   `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
#   ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
#   (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
#   Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
#   09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
#   generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
#   rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
#   min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
#   suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
#   regulacion secundaria ($/MW` <dbl>

【讨论】：

这很完美，只有一个问题。我的主要目标是创建一个循环来遍历许多日期，但是由于 url 以及因此的日期参数是幻像提示上的输入，而不是 R 中的输入，我想知道是否可以创建类似的东西幻像中的虚拟 url html.page 然后在 R 中放置正确的 url
@Garcher 啊，我明白了，我确信有一种方法可以将参数传递给 system 调用，只是不确定我的头顶。可能值得问一个 PhantomJS 特定的问题。我会花几分钟看看我是否能弄清楚。您是否有一些日期/网址可以作为示例放入您的问题中？
完美，我会听取你的建议。也许这些例子会有所帮助cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/…cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/…cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/…

【解决方案2】：

以下不是很优雅，但应该可以工作！

library(curl)
library(xml2)

url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
fi <- tempfile()

h <- new_handle(ssl_verifypeer = FALSE)
str_page <- rawToChar(curl_fetch_memory(url, h)$content)
xml_page <- read_html(str_page)
txt <- xml_text(xml_find_all(xml_page, "//script"))
txt <- unlist(strsplit(txt, ";", fixed = TRUE))
str(as.list(txt))

clean <- function(x) trimws(gsub('"', "", x))

cnames <- txt[grep("vnctab\\s*=", txt)]
cnames <- gsub("(^.*?\\[|\\]\\s*$)", "", cnames)
cnames <- clean(unlist(strsplit(cnames, ",")))

tab <- txt[grep("vdatrep\\s*=", txt)]
substr(tab, 1, 1000)
substr(tab, nchar(tab)-1000, nchar(tab))
tab <- gsub("^.*?\\[\\s*\\[", "", tab)
tab <- gsub("\\],*\\s*\\]$", "", tab)
tab_rows <- unlist(strsplit(tab, "\\]\\s*,*\\s*\\["))
tab <- strsplit(tab_rows, ",")

M <- do.call(rbind, lapply(tab, clean))
d1 <- as.data.frame(M[,1:2], stringsAsFactors = FALSE)
d2 <- as.data.frame(apply(M[,-(1:2)], 2, as.double), stringsAsFactors = FALSE)
d <-  cbind(d1, d2)
dim(d); length(cnames)
colnames(d) <- cnames
sapply(d, class)
str(d)

【讨论】：