如何使用 dplyr 将本地数据导入只读数据库？答案

【问题标题】：How do I get local data into a read-only database using dplyr?如何使用 dplyr 将本地数据导入只读数据库？
【发布时间】：2021-04-01 19:01:23
【问题描述】：

WRDS 是为商业和相关领域的学者和其他研究人员提供研究数据的领先提供商。 WRDS 提供 PostgreSQL 数据库，但这是一个只读数据库。

对于某些任务，无法将数据写入数据库是非常受限制的。例如，如果我想使用每日股票收益进行事件研究，我需要将我的（相对较小的）本地数据集 events 与 crsp.dsf 合并，大约 18GB 的数据。

一种选择是使用crsp.dsf 的副本维护我自己的数据库，并将events 写入该数据库并在那里合并。但我正在寻找一个允许我为此目的使用 WRDS 数据库的选项。不幸的是，没有办法使用copy_to 或dbWriteTable，因为 WRDS 数据库是只读的。

【问题讨论】：

标签： postgresql dbplyr wrds

【解决方案1】：

一种选择是使用类似下面的函数，它使用 SQL 将本地数据帧转换为远程数据帧即使使用只读连接。

df_to_pg <- function(df, conn) {

    collapse <- function(x) paste0("(", paste(x, collapse = ", "), ")")

    names <- paste(DBI::dbQuoteIdentifier(conn, names(df)), collapse = ", ")

    values <-
        df %>%
        lapply(DBI::dbQuoteLiteral, conn = conn) %>%
        purrr::transpose() %>%
        lapply(collapse) %>%
        paste(collapse = ",\n")

    the_sql <- paste("SELECT * FROM (VALUES", values, ") AS t (", names, ")")

    temp_df_sql <- dplyr::tbl(conn, dplyr::sql(the_sql))
    
    return(temp_df_sql)
}

这里是正在使用的函数的图示。功能已在 PostgreSQL 和 SQL Server 上测试过，但无法在 SQLite 上运行（由于缺少以这种方式工作的 VALUES 关键字）。我相信它应该适用于 MySQL 或 Oracle，因为它们有 VALUES 关键字。

library(dplyr, warn.conflicts = FALSE)
library(DBI)
   
pg <- dbConnect(RPostgres::Postgres())     

events <- tibble(firm_ids = 10000:10024L,
                 date = seq(from = as.Date("2020-03-14"), 
                            length = length(firm_ids), 
                            by = 1))
events
#> # A tibble: 25 x 2
#>    firm_ids date      
#>       <int> <date>    
#>  1    10000 2020-03-14
#>  2    10001 2020-03-15
#>  3    10002 2020-03-16
#>  4    10003 2020-03-17
#>  5    10004 2020-03-18
#>  6    10005 2020-03-19
#>  7    10006 2020-03-20
#>  8    10007 2020-03-21
#>  9    10008 2020-03-22
#> 10    10009 2020-03-23
#> # … with 15 more rows

events_pg <- df_to_pg(events, pg)
events_pg
#> # Source:   SQL [?? x 2]
#> # Database: postgres [iangow@/tmp:5432/crsp]
#>    firm_ids date      
#>       <int> <date>    
#>  1    10000 2020-03-14
#>  2    10001 2020-03-15
#>  3    10002 2020-03-16
#>  4    10003 2020-03-17
#>  5    10004 2020-03-18
#>  6    10005 2020-03-19
#>  7    10006 2020-03-20
#>  8    10007 2020-03-21
#>  9    10008 2020-03-22
#> 10    10009 2020-03-23
#> # … with more rows

^{由reprex package (v1.0.0) 于 2021-04-01 创建}

【讨论】：

这太棒了！但是我无法在 Oracle 中重现您的示例。您知道如何重写该函数以使其在 Oracle 中也能正常工作吗？这是我收到的错误消息（缩短以适合注释）：错误：nanodbc/nanodbc.cpp:1617: 42S02: [Oracle][ODBC][Ora]ORA-00903: invalid table name 'SELECT * FROM (SELECT * FROM (VALUES (10000, '2020-03-14 UTC'), (10001, '2020-03-15 UTC'), ... (10024, '2020-04-07 UTC') ) AS t ( "firm_ids", "date" )) "q01" WHERE (0 = 1)'
@CAJ 我不知道。我的猜测是调整函数以返回 SQL 然后（使用小表）调整 SQL 以适应 Oracle 语法的要求可能是有意义的。看起来“AS t (”部分可能会被 Oracle 解析为表名。