【发布时间】:2018-07-26 02:09:18
【问题描述】:
我想按特定列对 sparklyr 表的行进行分组,并对满足特定条件的行进行计数。
例如,在下面的钻石表中,我想按颜色分组,并计算价格>400的行数。
> library(sparklyr)
> library(tidyverse)
> con = spark_connect(....)
> diamonds = copy_to(con, diamonds)
> diamonds
# Source: table<diamonds> [?? x 10]
# Database: spark_connection
carat cut color clarity depth table price x y z
<dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.310 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
6 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
7 0.240 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
8 0.260 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
9 0.220 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
10 0.230 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
这是一项我会在普通 R 中以多种方式完成的任务。但是在 sparklyr 中没有一个工作。
例如:
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
> diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=length(price[price>400]))
这适用于传统数据框:
# A tibble: 7 x 3
color n n_expensive
<ord> <int> <int>
1 D 6775 6756
2 E 9797 9758
3 F 9542 9517
4 G 11292 11257
5 H 8304 8274
6 I 5422 5379
7 J 2808 2748
但不是在火花中:
diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'sum((CAST(diamonds.`price` AS BIGINT) > 400L))' due to data type mismatch: function sum requires numeric types, not BooleanType; l
ine 1 pos 33;
Error in eval_bare(call, env) : object 'price' not found
【问题讨论】:
标签: r apache-spark group-by tidyverse sparklyr