【问题标题】:Is sample_n really a random sample when used with sparklyr?与 sparklyr 一起使用时,sample_n 真的是随机样本吗?
【发布时间】:2019-01-01 07:14:27
【问题描述】:

我在 spark 数据框中有 5 亿行。我有兴趣使用来自dplyrsample_n,因为它可以让我明确指定我想要的样本量。如果我要使用sparklyr::sdf_sample(),我首先必须计算sdf_nrow(),然后创建指定的数据部分sample_size / nrow,然后将此部分传递给sdf_sample。这没什么大不了的,但sdf_nrow() 可能需要一段时间才能完成。

因此,最好直接使用dplyr::sample_n()。但是,经过一些测试,sample_n() 看起来并不是随机的。其实结果和head()是一样的!如果函数不是随机采样行,而是只返回第一行 n 行,这将是一个主要问题。

其他人可以证实这一点吗? sdf_sample() 是我的最佳选择吗?

# install.packages("gapminder")

library(gapminder)
library(sparklyr)
library(purrr)

sc <- spark_connect(master = "yarn-client")

spark_data <- sdf_import(gapminder, sc, "gapminder")


> # Appears to be random
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    58.83397


> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    60.31693


> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.38692
> 
> 
> # Appears to be random
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    60.48903


> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.44187


> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.27986
> 
> 
> # Does not appear to be random
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434


> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434


> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434
> 
> 
> 
> # === Test sample_n() ===
> sample_mean <- list()
> 
> for(i in 1:20){
+   
+   sample_mean[i] <- spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp)) %>% collect() %>% pull()
+   
+ }
> 
> 
> sample_mean %>% flatten_dbl() %>% mean()
[1] 57.78434
> sample_mean %>% flatten_dbl() %>% sd()
[1] 0
> 
> 
> # === Test head() ===
> spark_data %>% 
+   head(300) %>% 
+   pull(lifeExp) %>% 
+   mean()
[1] 57.78434

【问题讨论】:

    标签: r apache-spark random dplyr sparklyr


    【解决方案1】:

    事实并非如此。如果您检查执行计划(optimizedPlan 定义的函数 here),您会发现它只是一个限制:

    spark_data %>% sample_n(300) %>% optimizedPlan()
    
    <jobj[168]>
      org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
      GlobalLimit 300
    +- LocalLimit 300
       +- InMemoryRelation [country#151, continent#152, year#153, lifeExp#154, pop#155, gdpPercap#156], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `gapminder`
             +- Scan ExistingRDD[country#151,continent#152,year#153,lifeExp#154,pop#155,gdpPercap#156] 
    

    show_query进一步证实了这一点:

    spark_data %>% sample_n(300) %>% show_query()
    
    <SQL>
    SELECT *
    FROM (SELECT *
    FROM `gapminder` TABLESAMPLE (300 rows) ) `hntcybtgns`
    

    以及可视化的执行计划:

    最后,如果你检查Spark source,你会发现这个案例是用简单的LIMIT实现的:

    case ctx: SampleByRowsContext =>
      Limit(expression(ctx.expression), query)
    

    我相信这种语义是从 Hive where equivalent query takes n first rows from each input split 继承而来的。

    在实践中,获取精确尺寸的样本非常昂贵,除非绝对必要,否则应避免使用(与大 LIMITS 相同)。

    【讨论】:

      猜你喜欢
      • 2015-06-14
      • 2018-09-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-10-09
      • 1970-01-01
      • 1970-01-01
      • 2016-03-08
      相关资源
      最近更新 更多