【问题标题】：Improve speed of queries using complex survey design in R使用 R 中的复杂调查设计提高查询速度
【发布时间】：2015-11-29 14:03:09
【问题描述】：

我有一个大型数据集（超过 2000 万个 obs），我使用 survey 包进行分析，运行简单查询需要很长时间。我试图找到一种方法来加速我的代码，但我想知道是否有更好的方法来提高效率。

在我的基准测试中，我使用svyby/svytotal比较了三个命令的速度：

简单命令svyby/svytotal
foreachdopar 7 核并行计算
选项 2 的编译版本

剧透：选项 3 的速度是第一个选项的两倍多，但它不适合大型数据集，因为它依赖于并行计算，在处理大型数据时会很快达到内存限制套。尽管我有 16GB 的 RAM，我也面临这个问题。有几个solutions to this memory limitation，但都不适用于勘测设计对象。

关于如何使它更快并且不会因为内存限制而崩溃的任何想法？

我的代码带有可重现的示例：

# Load Packages
library(survey)
library(data.table)
library(compiler)
library(foreach) 
library(doParallel)
options(digits=3)

# Load Data
data(api)

# Convert data to data.table format (mostly to increase speed of the process)
apiclus1 <- as.data.table(apiclus1)

# Multiplicate data observations by 1000 
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ]

# create a count variable
apiclus1[, Vcount := 1]

# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

1) 简单代码

t1 <- Sys.time()
table1 <- svyby(~Vcount,
                ~stype+dnum+cname,
                design = dclus1,
                svytotal)
T1 <- Sys.time() - t1

2) 使用 7 核的 foreach dopar 并行计算

# in this option, I create a list with different subsets of the survey design
# that will be passed to different CPU cores to work at the same time

subdesign <- function(i){ subset(dclus1, dnum==i)}
groups <- unique(apiclus1$dnum)
list_subsets <- lapply(groups[], subdesign) # apply function and get all     subsets in a list
i <- NULL

# Start Parallel
registerDoParallel(cores=7)

t2 <- Sys.time()
table2 <- foreach (i = list_subsets,  .combine= rbind, .packages="survey")     %dopar% {
  options( survey.lonely.psu = "remove" )
  svyby(~Vcount,
        ~stype+dnum+cname,
        design = i,
        svytotal)}
T2 <- Sys.time() - t2

3。选项 2 的编译版本

# make a function of the previous query
query2 <- function (list_subsets) { foreach (i = list_subsets,  .combine=     rbind, .packages="survey") %dopar% {
  svyby(~Vcount,
        ~stype+dnum+cname,
        design = i,
        svytotal)}}

# Compile the function to increase speed
query3 <- cmpfun(query2 )

t3 <- Sys.time()
table3 <- query3 (list_subsets)
T3 <- Sys.time() - t3

结果

>T1: 1.9 secs
>T2: 1.13 secs
>T3  0.58 secs

barplot(c(T1, T2, T3),  
        names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"),
        ylab="Seconds")

【问题讨论】：

请参阅包 ref 中的 refdata 以获取在不为并行处理创建副本的情况下对数据进行子集化的选项。
我试过 refdata @A.Webb 但没有用。代码变慢了，它仍然达到内存限制。我可能做错了groups <- unique(apiclus1$dnum) subdesign <- function(i){ refdata(subset(dclus1, dnum==i))} list_subsets <- lapply(groups[], subdesign) i <- NULL table3 <- foreach (i = 1:length(groups), .combine= rbind, .packages=c("survey","ref")) %dopar% { options( survey.lonely.psu = "remove" ) svyby(~Vcount, ~stype+dnum+cname, design = derefdata(list_subsets[[i]]), svytotal)}
@RafaelPereira 一起使用 MonetDB.R 和 survey。例如，请参阅github.com/ajdamico/asdfree/…
嗨@RafaelPereira，你有没有在这方面取得任何进展？对 4000 万个观测调查数据集 (stackoverflow.com/questions/35210712/…) 有类似问题。尝试实施您描述的“foreach”方法并遇到相同的内存限制问题。
嗨@charlie。我还没有找到其他选择。但是，Anthony Damico 在下面提供了一个非常令人满意的答案。

标签： r performance memory survey

【解决方案1】：

感谢您将这个问题提出得这么好。在 R 中高效地处理大型调查数据集可能需要一些基本的 SQL 语法（这比 R 更容易学习）。 MonetDB 是唯一与survey 包兼容的大数据选项，探索其他高性能包（可能）不会有成果。通常，当我探索一个巨大的数据集时，我直接在 SQL 查询中编写而不是使用调查包，因为标准误差计算是计算密集型的（并且在交互式数据探索期间方差不那么有用）。注意最终的 SQL 时间戳是如何排除所有其他选项的。要计算快速加权平均值，请使用 "SELECT by_column , SUM( your_column * the_weight ) / SUM( the_weight ) FROM yourdata GROUP BY by_column"

当您确实需要交互式标准错误时，线性化 (svydesign) 通常比复制 (svrepdesign) 计算密集度更高，但有时创建复制设计（就像我在下面使用 jk1w_dclus1 所做的那样）需要比某些用户更熟悉的调查方法。

# Load Packages
library(MonetDB.R)
library(MonetDBLite)
library(DBI)   # suggested in comments and needed on OSX
library(survey)

# Load Data
data(api)

# Multiplicate data observations by 10000 
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ]

# create a count variable
apiclus1$vcount <- 1

# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)


dbfolder <- tempdir()

db <- dbConnect( MonetDBLite() , dbfolder )
dbWriteTable( db , 'apiclus1' , apiclus1 )


db_dclus1 <-
    svydesign(
        weight = ~pw ,
        id = ~dnum ,
        data = "apiclus1" , 
        dbtype = "MonetDBLite" ,
        dbname = dbfolder ,
        fpc = ~fpc
    )

# you provided a design without strata,
# so type="JK1" matches that most closely.
# but see survey:::as.svrepdesign for other linearization-to-replication options
jk1w <- jk1weights( psu = apiclus1$dnum , fpc = apiclus1$fpc )

# after the replicate-weights have been constructed,
# here's the `svrepdesign` call..
jk1w_dclus1 <-
    svrepdesign(
        weight = ~pw ,
        type = "JK1" ,
        repweights = jk1w$repweights ,
        combined.weights = FALSE ,
        scale = jk1w$scale ,
        rscales = jk1w$rscales ,
        data = 'apiclus1' ,
        dbtype = "MonetDBLite" ,
        dbname = dbfolder
    )

# slow
system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
# > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
   # user  system elapsed 
  # 17.40    2.86   20.27 


# faster
system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
# > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
   # user  system elapsed 
  # 13.00    1.20   14.18 


# fastest
system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
# > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
   # user  system elapsed 
  # 10.75    1.19   11.96 

# same standard errors across the board
all.equal( SE( res1 ) , SE( res2 ) )
all.equal( SE( res2 ) , SE( res3 ) )
# NOTE: the replicate-weighted design will be slightly different
# for certain designs.  however this technique is defensible
# and gets used in 
# https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico


# at the point you do not care about standard errors,
# learn some sql:
system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
# because this is near-instantaneous, no matter how much data you have.

# same numbers as res1:
all.equal( as.numeric( sort( coef( res1 ) ) ) , sort( res4$L1 ) )
# > system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
   # user  system elapsed 
   # 0.15    0.20    0.23

【讨论】：

嗨。我无法复制你的答案。当我运行 db <- dbConnect( MonetDBLite() , dbfolder ) 行时，我收到以下错误：Error in MonetDBLite::monetdb_embedded_startup(embedded, !getOption("monetdb.debug.embedded", : unused argument (getOption("monetdb.sequential", TRUE))。对正在发生的事情有任何想法吗？我正在使用在最新的 Rstudio 0.99.893 和 Windows 10 中修订的 R 3.2.4
@RafaelPereira 尝试使用library(DBI)，如果这仍然不起作用，请打开一个单独的 stackoverflow 问题，其中包含一个最小的可重现示例——无法启动 monetdblite 是另一回事，谢谢
我刚刚创建了这个问题（但不确定它是否框架良好）Have a look
@AnthonyDamico 如果您需要运行svyglm，该怎么办？