【问题标题】:How do I use R to download LEHD data from the website?如何使用 R 从网站下载 LEHD 数据?
【发布时间】:2017-06-04 21:09:02
【问题描述】:

我想知道如何从他们的 FTP 站点下载 LEHD 文件。

https://lehd.ces.census.gov/data/lodes/LODES7/

我需要下载多年的数据,包括工作场所和居住地位置。这些文件定期命名,技术文档可以在这里找到:

https://lehd.ces.census.gov/data/lodes/LODES7/LODESTechDoc7.2.pdf S000 引用所有劳动力段 JT00 引用所有工作类型

所以一个典型的文件名是:ca_wac_S000_JT00_2008.csv.gz 在“目录”/URL 中:https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/

This bit of git-hub code seems relevantHarvard tutorial 很有用,它为我提供了一种创建所有文件列表的方法。但是当我遇到 SSL 问题时,我无法让实际下载工作--R.curl hasn't worked for me

扩展代码似乎也不起作用:

install.packages("RCurl")
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))   
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
x
#the above code works.

#my implementation...fails
URL <- "https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2002.csv.gz"
x <- getURL(URL)
#results in following error:
#Error in function (type, msg, asError = TRUE)  : 
# error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure

devtools::session_info() 会话信息 --------------------------------------- ------------------------------------------------ 设定值版本 R 版本 3.4.3 (2017-11-30) 系统 x86_64,mingw32 ui RStudio (1.1.383) 语言 (EN) 校对 English_United States.1252 tz America/Denver
日期 2017-12-17

包 -------------------------------------------------- ----------------------------------------- 包 * 版本日期源 acs * 2.1。 2
2017-10-10 CRAN (R 3.4.3) 断言 0.2.0 2017-04-11 CRAN (R 3.4.3) 基础 * 3.4.3 2017-12-06 本地绑定器 0.1 2016-11-13 CRAN (R 3.4.3) bindrcpp 0.2 2017-06-17 CRAN (R 3.4.3) 类 7.3-14 2015-08 -30 克兰 (R 3.4.3) classInt 0.1-24 2017-04-16 CRAN (R 3.4.3) 编译器 3.4.3
2017-12-06 本地卷曲 * 3.1 2017-12-12 CRAN (R 3.4.3) 数据集 * 3.4.3 2017-12-06 本地 DBI 0.7 2017-06-18 CRAN (R 3.4.3) devtools * 1.13.4 2017-11-09 CRAN (R 3.4.3) 摘要 0.6.13 2017-12-14 CRAN (R 3.4.3) dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.3) e1071 1.6-8 2017-02-02 CRAN (R 3.4.3) 国外 0.8-69 2017-06-22 CRAN (R 3.4.3) gdtools * 0.1.6 2017-09-01 CRAN (R 3.4.3) git2r 0.19.0
2017-07-19 CRAN(R 3.4.3)胶水1.2.0 2017-10-29 CRAN(R 3.4.3) 图形 * 3.4.3 2017-12-06 本地 grDevices * 3.4.3 2017-12-06 本地网格 3.4.3 2017-12-06 本地 hms 0.4.0 2017-11-23 CRAN (R 3.4. 3)httr 1.3.1 2017-08-20 CRAN (R 3.4.3) 晶格 0.20-35 2017-03-25 CRAN (R 3.4.3) lodes * 0.1.0 2017-12-17 git (@8cca008) magrittr 1.5 2014-11-22 CRAN (R 3.4.3) 地图工具 0.9-2
2017-03-25 CRAN (R 3.4.3) memoise 1.1.0 2017-04-21 CRAN (R 3.4.3) 方法 * 3.4.3 2017-12-06 本地 pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.3) plyr 1.8.4 2016-06-08 CRAN (R 3.4.3) purrr 0.2。 4 2017-10-18 CRAN (R 3.4.3) R6
2.2.2 2017-06-17 CRAN (R 3.4.3) rappdirs 0.3.1 2016-03-28 CRAN (R 3.4.3) Rcpp 0.12.14 2017-11-23 CRAN (R 3.4.3) 读卡器 1.1.1 2017-05-16 CRAN (R 3.4.3) rgdal 1.2-16 2017-11-21 CRAN (R 3.4.3) rgeos 0.3-26 2017-10-31 CRAN (R 3.4.3) rlang 0.1.4 2017-11-05 CRAN (R 3.4.3) sf 0.5-5 2017-10-31 CRAN (R 3.4.3) sp * 1.2-5 2017-06-29 CRAN (R 3.4.3) 统计 * 3.4.3 2017-12-06 本地字符串 1.1.6 2017-11-17 CRAN (R 3.4.2) stringr * 1.2.0 2017-02-18 CRAN (R 3.4.3) tibble 1.3.4 2017-08-22 CRAN (R 3.4.3) 底格里斯 * 0.5.3
2017-05-26 CRAN (R 3.4.3) 工具 3.4.3 2017-12-06 本地
udunits2 0.13 2016-11-17 CRAN (R 3.4.1) 单位 0.4-6
2017-08-27 CRAN (R 3.4.3) 实用程序 * 3.4.3 2017-12-06 本地
uuid 0.1-2 2015-07-28 CRAN (R 3.4.1) withr 2.1.0
2017-11-01 CRAN (R 3.4.3) XML * 3.98-1.9 2017-06-19 CRAN (R 3.4.1)

【问题讨论】:

    标签: r rcurl census


    【解决方案1】:

    如果你可以使用 GitHub 可安装的包(在我在 CRAN 上得到这个之前还需要一点时间),那么你可以试试 https://github.com/hrbrmstr/lodes

    devtools::install_git("https://github.com/hrbrmstr/lodes.git")
    
    library(lodes)
    library(dplyr)
    
    de <- read_lodes("de", "od", "aux", "JT00", "2006", "~/Data/lodes")
    
    glimpse(de)
    ## Observations: 68,284
    ## Variables: 13
    ## $ w_geocode  <dbl> 1.000104e+14, 1.000104e+14, 1.000104e+14, 1.000104e+14, 1.000104e+14, 1.000104e+14, 1.000104e+14...
    ## $ h_geocode  <chr> "240119550001006", "240119550001040", "240299501002080", "240299501003088", "240299503002017", "...
    ## $ S000       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    ## $ SA01       <int> 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, ...
    ## $ SA02       <int> 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, ...
    ## $ SA03       <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
    ## $ SE01       <int> 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, ...
    ## $ SE02       <int> 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, ...
    ## $ SE03       <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, ...
    ## $ SI01       <int> 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, ...
    ## $ SI02       <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
    ## $ SI03       <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...
    ## $ createdate <int> 20160228, 20160228, 20160228, 20160228, 20160228, 20160228, 20160228, 20160228, 20160228, 201602...
    

    它具有读取和缓存人行横道文件的功能以及读取和缓存单个数据文件的功能。

    如果您仍然遇到 SSL 故障,请告诉我,如果是,请将 devtools::session_info()sessionInfo() 的输出添加到您的问题中。

    【讨论】:

    • library(devtools) devtools::install_git("github.com/hrbrmstr/lodes.git") library(lodes) library(dplyr) getwd() setwd("N:/Dropbox/_BonesFirst/LODES") de
    【解决方案2】:

    我找到了解决方案here。它并不完美,因为它将文件加载到内存中,而不是将它们保存到磁盘。但它确实对我有用。

    years.to.download <- c(2002,2004,2014)
    options(scipen = 999) # Supress scientific notation so we can see census geocodes
    library(plyr); library(dplyr)
    library(downloader) # downloads and then runs the source() function on scripts from github
    library(R.utils) # load the R.utils package (counts the number of lines in a file quickly)
    
    
    # Program start ----------------------------------------------------------------
    tf <- tempfile(); td <- tempdir() # Create a temporary file and a temporary directory
    # Load the download.cache and related functions
    # to prevent re-downloading of files once they've been downloaded.
    source_url(
      "https://raw.github.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R",
      prompt = FALSE,
      echo = FALSE
    )
    # Loop through and download each year specified by the user
    for(year in years.to.download) {
      cat("now loading", year, "...", '\n\r')
    #-----------Data import: residence area characteristics---------------------  
      # Data import: workplace area characteristics (i.e. job location data)
      # Download each year of data
      # Zipped file to the temporary file on your local disk
      # S000 references all workforce segments
      # JT00 references all job types
      download_cached(
        url = paste0("http://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_", year, ".csv.gz"),
        destfile = tf,
        mode = 'wb'
      )
    
    # Create a variable to store the wac file for each year
      assign(paste0("wac.", year), read.table(gzfile(tf), header = TRUE, sep = ",",
                                              colClasses = "numeric", stringsAsFactors = FALSE))
      # Remove the temporary file from the local disk
      file.remove(tf)
      # And free up RAM
      gc()
    
    #-----------Data import: residence area characteristics---------------------
      download_cached(
        url = paste0("http://lehd.ces.census.gov/data/lodes/LODES7/ca/rac/ca_rac_S000_JT00_", year, ".csv.gz"),
        destfile = tf,
        mode = 'wb'
      )
        # Create a variable to store the rac file for each year
      assign(paste0("rac.", year), read.table(gzfile(tf), header = TRUE, sep = ",",
                                              colClasses = "numeric", stringsAsFactors = FALSE))
        # Remove the temporary file from the local disk
      file.remove(tf)
        # And free up RAM
      gc()
    }
    

    【讨论】:

      猜你喜欢
      • 2021-12-19
      • 2016-07-16
      • 2022-01-26
      • 2016-08-30
      • 1970-01-01
      • 2021-03-09
      • 1970-01-01
      • 2013-08-05
      • 1970-01-01
      相关资源
      最近更新 更多