【问题标题】:xml2::read_html with proper character encoding crashes on Ubuntu具有正确字符编码的 xml2::read_html 在 Ubuntu 上崩溃
【发布时间】:2017-05-18 08:09:35
【问题描述】:

xml2::read_html 在尝试使用正确的字符编码时会在 Ubuntu(但不是 Mac)上崩溃。

   library(xml2)
   library(httr) 
   # GET webpage that is encoded using Big5 (Chinese)
   pg <- GET("http://chinesenews.net.au")
   # Identify encoding using rvest package function, which returns 
   # incorrect encoding as ISO-8859-1
   enc1 <- rvest::guess_encoding(httr::content(pg, "raw"))$encoding[1]
   # Use hack to identify the right encoding using a function from stringi package
   enc2 <- as.character(
                as.data.frame(
                      stringi::stri_enc_detect(httr::content(pg, "raw"))[[1]])[1,1]) 
   # So far so good. 
   # Let's try to read_html with both encodings
   # Using ISO-8859-1 encoding, there is not problem
   ht1 <- xml2::read_html(pg, encoding=enc1) # Reads, but characters are distorted
   # However, using correct (Big5) encoding crashes on Ubuntu
   ht2 <- xml2::read_html(pg, encoding=enc2) 

错误是:

doc_parse_raw 中的错误(x, encoding = encoding, base_url = base_url, as_html = as_html, : basic_string::_M_replace_aux

由于问题发生在 Ubuntu 而不是 Mac 上,因此尝试使用安装最新版本的 xml2 库

devtools::install_github("hadley/xml2")

仍然有一个错误,虽然是不同的:

doc_parse_raw 中的错误(x, encoding = encoding, base_url = base_url, as_html = as_html, : 由于输入错误,输入转换失败,字节 0xFB 0x7C 0xB7 0x51 [6003]

我不确定为什么传递正确的编码会使 libxlm2 崩溃。有什么想法可以做什么?

这是我的 Ubuntu sessionInfo():

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] httr_1.2.1   xml2_1.0.0   magrittr_1.5

loaded via a namespace (and not attached):
 [1] selectr_0.3-1   R6_2.2.0        tools_3.2.3     curl_2.3       
 [5] urltools_1.6.0  Rcpp_0.12.8     triebeard_0.3.0 stringi_1.1.2  
 [9] stringr_1.1.0   rvest_0.3.2     purrr_0.2.2   

【问题讨论】:

    标签: r ubuntu character-encoding rvest xml2


    【解决方案1】:

    试试encoding = "latin1"

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-10-24
    • 1970-01-01
    • 2022-01-22
    • 1970-01-01
    • 2012-07-30
    • 1970-01-01
    相关资源
    最近更新 更多