【发布时间】:2019-01-01 07:14:27
【问题描述】:
我在 spark 数据框中有 5 亿行。我有兴趣使用来自dplyr 的sample_n,因为它可以让我明确指定我想要的样本量。如果我要使用sparklyr::sdf_sample(),我首先必须计算sdf_nrow(),然后创建指定的数据部分sample_size / nrow,然后将此部分传递给sdf_sample。这没什么大不了的,但sdf_nrow() 可能需要一段时间才能完成。
因此,最好直接使用dplyr::sample_n()。但是,经过一些测试,sample_n() 看起来并不是随机的。其实结果和head()是一样的!如果函数不是随机采样行,而是只返回第一行 n 行,这将是一个主要问题。
其他人可以证实这一点吗? sdf_sample() 是我的最佳选择吗?
# install.packages("gapminder")
library(gapminder)
library(sparklyr)
library(purrr)
sc <- spark_connect(master = "yarn-client")
spark_data <- sdf_import(gapminder, sc, "gapminder")
> # Appears to be random
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 58.83397
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 60.31693
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 59.38692
>
>
> # Appears to be random
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 60.48903
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 59.44187
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 59.27986
>
>
> # Does not appear to be random
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 57.78434
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 57.78434
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source: lazy query [?? x 1]
# Database: spark_connection
sample_mean
<dbl>
1 57.78434
>
>
>
> # === Test sample_n() ===
> sample_mean <- list()
>
> for(i in 1:20){
+
+ sample_mean[i] <- spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp)) %>% collect() %>% pull()
+
+ }
>
>
> sample_mean %>% flatten_dbl() %>% mean()
[1] 57.78434
> sample_mean %>% flatten_dbl() %>% sd()
[1] 0
>
>
> # === Test head() ===
> spark_data %>%
+ head(300) %>%
+ pull(lifeExp) %>%
+ mean()
[1] 57.78434
【问题讨论】:
标签: r apache-spark random dplyr sparklyr