【问题标题】:Send SQL string through POST with httr package in R使用 R 中的 httr 包通过 POST 发送 SQL 字符串
【发布时间】:2018-10-14 10:34:03
【问题描述】:

我正在尝试从该网站https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=289 下载文件。 该网站上的表单会生成一个 POST 链接,该链接将请求提交到他们的服务器,以创建一个临时文件存储在这里 https://transtats.bts.gov/ftproot/TranStatsData/

关于表单数据,我可以看到以下内容:

UserTableName: DB1BCoupon
DBShortName: 
RawDataTable: T_DB1B_COUPON
sqlstr: +SELECT+ORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID+FROM++T_DB1B_COUPON+WHERE+Quarter+%3D1+AND+YEAR%3D2017
varlist: ORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID

基于上述内容,并使用 httr 包,我一直在尝试以下操作:

library(httr)

web <- https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=289

POST(web, body = "+SELECT+ORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID+FROM++T_DB1B_COUPON+WHERE+Quarter+%3D1+AND+YEAR%3D2017", encode = "form")

现在我希望得到一个包含以下信息的响应头:

Location: https://transtats.bts.gov/ftproot/TranStatsData/847324776_T_DB1B_COUPON.zip

但是,由于某种原因,我似乎无法理解。我确定 POST 的代码是错误的,但我不确定我在哪里或做错了什么。

【问题讨论】:

    标签: r post web-scraping httr


    【解决方案1】:

    迟到总比没有好(寻求答案)?

    POST 非常复杂,网站在处理后会重定向到另一个GET 以获取 ZIP 内容。

    右键单击POST 行并选择“复制为 cURL”。之后完全不要修改剪贴板,然后使用curlconverter将其变成R函数:

    library(curlconverter)
    
    straighten() %>% make_req() -> tmp # it automagically uses the clipboard contents
    

    在您的操作系统上点击“粘贴”,您将获得更长的版本:

    httr::POST(
      url = "https://www.transtats.bts.gov/DownLoad_Table.asp",
      httr::add_headers(
        Referer = "https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=289"
      ),
      body = list(
        UserTableName = "DB1BCoupon",
        DBShortName = "", 
        RawDataTable = "T_DB1B_COUPON",
        sqlstr = " SELECT ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID FROM T_DB1B_COUPON WHERE Quarter=1 AND YEAR=2018",
        varlist = "ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID",
        grouplist = "", suml = "",
        sumRegion = "", filter1 = "title=",
        filter2 = "title=", geo = "All\xa0",
        time = "Q+1", timename = "Quarter",
        GEOGRAPHY = "All", XYEAR = "2018",
        FREQUENCY = "1", 
        VarDesc = "ItinID", VarType = "Num", 
        VarDesc = "MktID", VarType = "Num", 
        VarDesc = "SeqNum", VarType = "Num", 
        VarDesc = "Coupons", VarType = "Num", 
        VarDesc = "Year", VarType = "Num", VarName = "ORIGIN_AIRPORT_ID", 
        VarDesc = "OriginAirportID", VarType = "Num", VarName = "ORIGIN_AIRPORT_SEQ_ID", 
        VarDesc = "OriginAirportSeqID", VarType = "Num", VarName = "ORIGIN_CITY_MARKET_ID", 
        VarDesc = "OriginCityMarketID", VarType = "Num", 
        VarDesc = "Quarter", VarType = "Num", 
        VarDesc = "Origin", VarType = "Char", 
        VarDesc = "OriginCountry", VarType = "Char", 
        VarDesc = "OriginStateFips", VarType = "Char", 
        VarDesc = "OriginState", VarType = "Char", 
        VarDesc = "OriginStateName", VarType = "Char", 
        VarDesc = "OriginWac", VarType = "Num", VarName = "DEST_AIRPORT_ID", 
        VarDesc = "DestAirportID", VarType = "Num", VarName = "DEST_AIRPORT_SEQ_ID", 
        VarDesc = "DestAirportSeqID", VarType = "Num", VarName = "DEST_CITY_MARKET_ID", 
        VarDesc = "DestCityMarketID", VarType = "Num", 
        VarDesc = "Dest", VarType = "Char", 
        VarDesc = "DestCountry", VarType = "Char", 
        VarDesc = "DestStateFips", VarType = "Char", 
        VarDesc = "DestState", VarType = "Char", 
        VarDesc = "DestStateName", VarType = "Char", 
        VarDesc = "DestWac", VarType = "Num", 
        VarDesc = "Break", VarType = "Char", 
        VarDesc = "CouponType", VarType = "Char", 
        VarDesc = "TkCarrier", VarType = "Char", 
        VarDesc = "OpCarrier", VarType = "Char", 
        VarDesc = "RPCarrier", VarType = "Char", 
        VarDesc = "Passengers", VarType = "Num", 
        VarDesc = "FareClass", VarType = "Char", 
        VarDesc = "Distance", VarType = "Num", 
        VarDesc = "DistanceGroup", VarType = "Num", 
        VarDesc = "Gateway", VarType = "Num", 
        VarDesc = "ItinGeoType", VarType = "Num", 
        VarDesc = "CouponGeoType", VarType = "Num"
      ), 
      encode = "form",
      query = list(
        Table_ID = "289",
        Has_Group = "0", 
        Is_Zipped = "0"
      )
    ) -> res
    

    (是的,甚至更长

    sqlstr 参数具有 SQL 查询,我不确定 POST 中有多少是“必需的”,但它“对我有用”

    res 肯定很大并且有二进制压缩数据:

    res
    ## Response [https://transtats.bts.gov/ftproot/TranStatsData/351117019_T_DB1B_COUPON.zip]
    ##   Date: 2018-10-14 02:18
    ##   Status: 200
    ##   Content-Type: application/x-zip-compressed
    ##   Size: 14.6 MB
    ## <BINARY BODY>
    

    我们可以将它保存到磁盘并确保它是有效的:

    (save_to <- file.path("~/Data", basename(grep("\\.zip", unlist(res$all_headers), value=TRUE))))
    ## [1] "~/Data/351117019_T_DB1B_COUPON.zip"
    
    writeBin(httr::content(res, as="raw"), save_to)
    
    unzip(save_to, list = TRUE)
    ##                          Name    Length                Date
    ## 1 351117019_T_DB1B_COUPON.csv 378311108 2018-10-13 22:18:00
    

    【讨论】:

      猜你喜欢
      • 2023-01-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-04-18
      • 2014-07-08
      • 2018-11-08
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多