【问题标题】:How to read file.rar directly from website in R如何在 R 中直接从网站读取 file.rar
【发布时间】:2018-03-19 14:58:32
【问题描述】:

我想下载一个压缩在 open-plaques-all-2017-06-19.rar 中的文件,但未能在 R 中实现。请看下面我的代码

temp <- tempfile()

download.file("https://github.com/tuyenhavan/Statistics/blob/master/open-plaques-all-2017-06-19.rar", temp)

df<- fread(unzip(temp, files = "open-plaques-all-2017-06-19.csv"))
head(df)

【问题讨论】:

    标签: r import zip data-extraction


    【解决方案1】:

    对于这些相应的平台/pkg 管理器,您需要:

    • deb: libarchive-dev(Debian、Ubuntu 等)
    • rpm:libarchive-devel(Fedora、CentOS、RHEL)
    • csw: libarchive_dev (Solaris)
    • brew: libarchive (Mac OSX)

    Windows 用户会为他们自动下载预编译的二进制文件。

    然后做:

    devtools::install_github("jimhester/archive") 
    

    这是一个工作流程。现在您指定的 URL 不正确/无效。您需要使用“原始”URL 来访问实际文件。

    library(archive)
    
    tf1 <- tempfile(fileext = ".rar")
    download.file("https://github.com/tuyenhavan/Statistics/blob/master/open-plaques-all-2017-06-19.rar?raw=true", tf1)
    
    tf2 <- tempfile()
    archive_extract(tf1, tf2)
    
    list.files(tf2)
    ## [1] "open-plaques-all-2017-06-19.csv"
    
    file.size(file.path(tf2, list.files(tf2)))
    ## [1] 26942816
    
    xdf <- readr::read_csv(file.path(tf2, list.files(tf2)))
    dplyr::glimpse(xdf)
    ## Observations: 38,436
    ## Variables: 27
    ## $ id                     <int> 29923, 42945, 42944, 42943, 42942, 42941, 42940, ...
    ## $ title                  <chr> "Jon Pertwee blue plaque", "Apsley Cherry-Garrard...
    ## $ inscription            <chr> "Jon Pertwee 1919-1996 Doctor Who 1970-1974", "Ap...
    ## $ latitude               <dbl> NA, NA, NA, NA, NA, NA, 54.14910, 45.76330, NA, 4...
    ## $ longitude              <dbl> NA, NA, NA, NA, NA, NA, -4.46938, 4.83157, NA, 4....
    ## $ country                <chr> "United Kingdom", "United Kingdom", "United Kingd...
    ## $ area                   <chr> "London", "Bedford", "Harlow", "Bozen", "Adro", "...
    ## $ address                <chr> "BBC Television Centre", "Lansdowne Road", "The W...
    ## $ erected                <int> NA, NA, NA, NA, NA, 2016, NA, NA, NA, NA, NA, NA,...
    ## $ main_photo             <chr> NA, "https://commons.wikimedia.org/wiki/Special:F...
    ## $ colour                 <chr> "blue", "blue", "blue", "brass", "brass", "brass"...
    ## $ organisations          <chr> "[]", "[]", "[\"Harlow Civic Society\"]", "[\"Gun...
    ## $ language               <chr> "English", "English", "English", "Italian", "Ital...
    ## $ series                 <chr> NA, NA, NA, "Stolpersteine Italiano", "Stolperste...
    ## $ series_ref             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
    ## $ `geolocated?`          <chr> "false", "false", "false", "false", "false", "fal...
    ## $ `photographed?`        <chr> "false", "true", "false", "true", "true", "true",...
    ## $ number_of_subjects     <int> 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0...
    ## $ lead_subject_name      <chr> "Jon Pertwee", "Apsley Cherry-Garrard", NA, NA, "...
    ## $ lead_subject_born_in   <int> 1919, 1886, NA, NA, 1911, 1913, NA, 1888, 1832, 1...
    ## $ lead_subject_died_in   <int> 1996, 1959, NA, NA, 1945, 1945, NA, 1967, 1898, 1...
    ## $ lead_subject_type      <chr> "man", "man", NA, NA, "man", "man", NA, "man", "m...
    ## $ lead_subject_roles     <chr> "[\"Doctor Who\", \"actor\", \"entertainer\", \"t...
    ## $ lead_subject_wikipedia <chr> "https://en.wikipedia.org/wiki/Jon_Pertwee", "htt...
    ## $ lead_subject_dbpedia   <chr> "http://dbpedia.org/resource/Jon_Pertwee", "http:...
    ## $ lead_subject_image     <chr> "https://commons.wikimedia.org/wiki/Special:FileP...
    ## $ subjects               <chr> "[\"Jon Pertwee|(1919-1996)|man|Doctor Who, actor...
    

    考虑 unlink()ing tf1,将文件从 tf2 复制到更永久的地方,然后在工作完成后清理 unlink()ing tf2

    【讨论】:

      【解决方案2】:

      我不知道是否有用于提取 RAR 存档的 R 库,但如果您安装了 unrarunarp7zip 或类似的东西,您可以通过系统调用调用它们并拥有它们提取文件。
      此外,您需要在 url 的末尾标记 ?raw=true 以获取原始数据(而不是 html 代码)。

      这是在 mac 上使用 p7zipunar,其他实用程序和系统可能需要不同的语法。

      temp <- tempfile()
      
      download.file(paste0("https://github.com/tuyenhavan/Statistics/blob/master/", 
                          "open-plaques-all-2017-06-19.rar?raw=true"), temp)
      
      #list all csv-files in current working directory
      csv_files <- list.files(pattern="\\.csv")
      
      #extract RAR to current working directory using p7zip
      system(paste("7z x", temp, paste0("-o", getwd())))
      
      #extract RAR to current working directory using unar
      system(paste("unar", "-f", "-o", shQuote(getwd()), shQuote(temp)))
      
      #find the name of the extracted csv file
      csv_new <- setdiff(list.files(pattern="\\.csv"), csv_files)
      
      #read in the csv as a data.frame
      csv.dtf <- read.csv(csv_new)
      

      你也可以直接在csv中读取,但是比较慢。

      csv <- system(paste("7z x -so", temp), intern=TRUE)
      csv.dtf <- read.csv(text=csv)
      

      【讨论】:

        猜你喜欢
        • 2019-09-08
        • 1970-01-01
        • 2018-09-13
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-08-31
        • 2018-12-13
        相关资源
        最近更新 更多