【问题标题】:R Webscraping RCurl and httr ContentR Webscraping RCurl 和 httr 内容
【发布时间】:2018-06-01 20:20:17
【问题描述】:

我正在学习一些关于网络抓取的知识,我对 2 个软件包(httr 和 RCurl)有点怀疑,我试图从 researchgate 网站上的杂志 (ISSN) 中获取代码,但我遇到了一种情况。当通过 httr 和 RCurl 从站点中提取内容时,我在 RCurl 包中获得了 ISSN,并且在 httr 中我的函数返回 NULL,谁能告诉我这是为什么?在我看来,这两个功能都在起作用。按照下面的代码。

library(rvest)
library(httr)
library(RCurl)

url <- "https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics"

########
# httr #
########

conexao <- GET(url)
conexao_status <- http_status(conexao)
conexao_status

content(conexao, as = "text", encoding = "utf-8") %>% read_html() -> webpage1

ISSN <- webpage1 %>%
  html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
  html_text %>%
  str_to_title() %>%
  str_split(" ") %>%
  unlist
ISSN

########
# RCurl #
########

options(RCurlOptions = list(verbose = FALSE, 
                            capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), 
                            ssl.verifypeer = FALSE))

webpage <- getURLContent(url) %>% read_html()

ISSN <- webpage %>%
  html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
  html_text %>%
  str_to_title() %>%
  str_split(" ") %>%
  unlist
ISSN

sessionInfo() R 版本 3.5.0 (2018-04-23) 平台:x86_64-w64-mingw32/x64 (64-bit) 运行于:Windows >= 8 x64 (build 9200)

矩阵产品:默认

语言环境:[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=葡萄牙语_巴西.1252 LC_MONETARY=葡萄牙语_巴西.1252 [4] LC_NUMERIC=C LC_TIME=葡萄牙语_巴西.1252

附加的基础包:[1] stats graphics grDevices utils
数据集方法库

其他附加包:[1] testit_0.7 dplyr_0.7.4
progress_1.1.2 readxl_1.1.0 stringr_1.3.0 RCurl_1.95-4.10 bitops_1.0-6 [8] httr_1.3.1 rvest_0.3.2 xml2_1.2.0
jsonlite_1.5

通过命名空间加载(未附加):[1] Rcpp_0.12.16
bindr_0.1.1 magrittr_1.5 R6_2.2.2 rlang_0.2.0
工具_3.5.0 [7] yaml_2.1.19 assertthat_0.2.0 tibble_1.4.2 bindrcpp_0.2.2 curl_3.2 glue_1.2.0
[13] stringi_1.1.7 支柱_1.2.2 编译器_3.5.0
cellranger_1.1.0 prettyunits_1.0.2 pkgconfig_2.0.1

【问题讨论】:

    标签: html web-scraping rvest rcurl httr


    【解决方案1】:

    因为内容类型是 JSON 而不是 HTML,所以不能在上面使用read_html()

    > conexao
    Response [https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics]
    Date: 2018-06-02 03:15
    Status: 200
    Content-Type: application/json; charset=utf-8
    Size: 328 kB
    

    使用fromJSON() 代替提取issn:

    library(jsonlite)
    result <- fromJSON(content(conexao, as = "text", encoding = "utf-8") )
    result$result$data$journalFullInfo$data$issn
    

    结果:

    > result$result$data$journalFullInfo$data$issn
    [1] "0730-0301"
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-12-03
      • 2018-03-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多