【问题标题】:HTML table scraping in RR中的HTML表格抓取
【发布时间】:2019-03-01 18:07:31
【问题描述】:

我正在尝试通过以下网址获取表格:https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-31%20v2019%2003%2001_01%2000%2001.html

问题是该表不是 html 表,所以 html_table() 不起作用。

到目前为止,我已经尝试从表中提取节点,但它没有返回任何内容

url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
webpage <- read_html(url)
table_html <- html_nodes(webpage, 'table#Tabc')
table <- html_table(table_html)

【问题讨论】:

    标签: html r web web-scraping


    【解决方案1】:

    所以这里的问题是页面是通过 javascript 呈现的。因此,单独使用 rvest 是行不通的。最简单的方法之一是使用无头网络浏览器。我们可以使用PhantomJS

    首先,下载适当版本的PhantomJS 并将可执行文件(假设是Windows)放在您的工作目录中。也就是说,phantomjs.exe 位于R 脚本的工作目录中。

    创建一个scrape.js 文件:

    // scrape.js
    
    var webPage = require('webpage');
    var page = webPage.create();
    
    var fs = require('fs');
    var path = 'page.html';
    
    page.open('https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html', function (status) {
      var content = page.content;
      fs.write(path,content,'w');
      phantom.exit();
    });
    

    这个scrape.js 文件一旦运行,就会在您的工作目录中创建一个page.html 文件。回到RRStudio,您可以执行以下操作:

    library(tidyverse)
    library(rvest)
    
    # Run scrape.js with PhantomJS to create the file page.html
    system("./phantomjs scrape.js")
    
    # Now we should be in business as usual:
    read_html('page.html') %>%
      html_nodes("table#Tabc") %>%
      html_table(header = TRUE) %>%
      .[[1]] %>%
      as_tibble()
    
    # A tibble: 504 x 38
       Codigo `Estatus asigna~  Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
       <chr>  <chr>            <int>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
     1 BTY5W~ ECO                  1               35               20           43212.              1.5            1762.              1.5
     2 BTY5W~ ECO                  2               35               20           43212.              1.5            1762.              1.5
     3 BTY5W~ ECO                  3               35               20           43212.              1.5            1762.              1.5
     4 BTY5W~ ECO                  4               35               20           43212.              1.5            1762.              1.5
     5 BTY5W~ ECO                  5               35               20           43212.              1.5            1762.              1.5
     6 BTY5W~ ECO                  6               35               20           43212.              1.5            1762.              1.5
     7 BTY5W~ ECO                  7               35               20           43212.              1.5            1762.              1.5
     8 BTY5W~ ECO                  8               35               20           43212.              1.5            1762.              1.5
     9 BTY5W~ ECO                  9               35               20           43212.              1.5            1762.              1.5
    10 BTY5W~ ECO                 10               35               20           43212.              1.5            1762.              1.5
    # ... with 494 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
    #   `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
    #   ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
    #   (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
    #   Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
    #   09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
    #   generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
    #   rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
    #   min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
    #   suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
    #   regulacion secundaria ($/MW` <dbl>
    

    更新为扩展到多个 URL

    首先,更改scrape.js 文件以接受参数:

    // scrape2.js
    
    var webPage = require('webpage');
    var page = webPage.create();
    var system = require('system');
    var args = system.args;
    
    var fs = require('fs');
    var path = args[2];
    
    page.open(args[1], function (status) {
      var content = page.content;
      fs.write(path,content,'w');
      phantom.exit();
    });
    

    接下来,创建列表以循环/遍历/映射(显然这可以被清理/抽象以更易于维护并且需要更少的输入):

    urls <- list(
      'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html',
      'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html',
      'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html'
    )
    
    paths <- list(
      'page1.html',
      'page2.html',
      'page3.html'
    )
    
    args_list <- map2(urls, paths, paste)
    
    # We are only using this function for the file creation side-effects,
    # so we can use walk instead of map. 
    # This creates the files: page1.html, page2.html, and page3.html 
    walk(args_list, ~ system(paste("./phantomjs scrape2.js", .)))
    

    此时,您可能希望将抓取的内容放入函数中:

    read_page <- function(page) {
      read_html(page) %>%
        html_nodes("table#Tabc") %>%
        html_table(header = TRUE) %>%
        .[[1]] %>%
        as_tibble()
    }
    

    您可以从那里重复使用路径列表来映射您的新功能:

    paths %>%
      map(~ read_page(.)) %>%
      bind_rows()
    
    # A tibble: 9,000 x 38
       Codigo `Estatus asigna~  Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
       <chr>  <chr>            <int>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
     1 BTY5W~ ECO                  1               35               20           43212.              1.5            1762.              1.5
     2 BTY5W~ ECO                  2               35               20           43212.              1.5            1762.              1.5
     3 BTY5W~ ECO                  3               35               20           43212.              1.5            1762.              1.5
     4 BTY5W~ ECO                  4               35               20           43212.              1.5            1762.              1.5
     5 BTY5W~ ECO                  5               35               20           43212.              1.5            1762.              1.5
     6 BTY5W~ ECO                  6               35               20           43212.              1.5            1762.              1.5
     7 BTY5W~ ECO                  7               35               20           43212.              1.5            1762.              1.5
     8 BTY5W~ ECO                  8               35               20           43212.              1.5            1762.              1.5
     9 BTY5W~ ECO                  9               35               20           43212.              1.5            1762.              1.5
    10 BTY5W~ ECO                 10               35               20           43212.              1.5            1762.              1.5
    # ... with 8,990 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
    #   `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
    #   ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
    #   (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
    #   Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
    #   09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
    #   generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
    #   rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
    #   min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
    #   suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
    #   regulacion secundaria ($/MW` <dbl>
    

    【讨论】:

    【解决方案2】:

    以下不是很优雅,但应该可以工作!

    library(curl)
    library(xml2)
    
    url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
    fi <- tempfile()
    
    h <- new_handle(ssl_verifypeer = FALSE)
    str_page <- rawToChar(curl_fetch_memory(url, h)$content)
    xml_page <- read_html(str_page)
    txt <- xml_text(xml_find_all(xml_page, "//script"))
    txt <- unlist(strsplit(txt, ";", fixed = TRUE))
    str(as.list(txt))
    
    clean <- function(x) trimws(gsub('"', "", x))
    
    cnames <- txt[grep("vnctab\\s*=", txt)]
    cnames <- gsub("(^.*?\\[|\\]\\s*$)", "", cnames)
    cnames <- clean(unlist(strsplit(cnames, ",")))
    
    tab <- txt[grep("vdatrep\\s*=", txt)]
    substr(tab, 1, 1000)
    substr(tab, nchar(tab)-1000, nchar(tab))
    tab <- gsub("^.*?\\[\\s*\\[", "", tab)
    tab <- gsub("\\],*\\s*\\]$", "", tab)
    tab_rows <- unlist(strsplit(tab, "\\]\\s*,*\\s*\\["))
    tab <- strsplit(tab_rows, ",")
    
    M <- do.call(rbind, lapply(tab, clean))
    d1 <- as.data.frame(M[,1:2], stringsAsFactors = FALSE)
    d2 <- as.data.frame(apply(M[,-(1:2)], 2, as.double), stringsAsFactors = FALSE)
    d <-  cbind(d1, d2)
    dim(d); length(cnames)
    colnames(d) <- cnames
    sapply(d, class)
    str(d)
    

    【讨论】:

      猜你喜欢
      • 2020-09-28
      • 1970-01-01
      • 1970-01-01
      • 2021-06-07
      • 2015-12-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-10-19
      相关资源
      最近更新 更多