计算组中满足条件的元素数答案

【问题标题】：Count number of elements in a group fulfilling a criteria计算组中满足条件的元素数
【发布时间】：2018-07-26 02:09:18
【问题描述】：

我想按特定列对 sparklyr 表的行进行分组，并对满足特定条件的行进行计数。

例如，在下面的钻石表中，我想按颜色分组，并计算价格>400的行数。

> library(sparklyr)
> library(tidyverse)
> con = spark_connect(....)

> diamonds = copy_to(con, diamonds)
> diamonds
# Source:   table<diamonds> [?? x 10]
# Database: spark_connection
   carat cut       color clarity depth table price     x     y     z
   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.230 Ideal     E     SI2      61.5  55.0   326  3.95  3.98  2.43
 2 0.210 Premium   E     SI1      59.8  61.0   326  3.89  3.84  2.31
 3 0.230 Good      E     VS1      56.9  65.0   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4  58.0   334  4.20  4.23  2.63
 5 0.310 Good      J     SI2      63.3  58.0   335  4.34  4.35  2.75
 6 0.240 Very Good J     VVS2     62.8  57.0   336  3.94  3.96  2.48
 7 0.240 Very Good I     VVS1     62.3  57.0   336  3.95  3.98  2.47
 8 0.260 Very Good H     SI1      61.9  55.0   337  4.07  4.11  2.53
 9 0.220 Fair      E     VS2      65.1  61.0   337  3.87  3.78  2.49
10 0.230 Very Good H     VS1      59.4  61.0   338  4.00  4.05  2.39

这是一项我会在普通 R 中以多种方式完成的任务。但是在 sparklyr 中没有一个工作。

例如：

 > diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
 > diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=length(price[price>400]))

这适用于传统数据框：

# A tibble: 7 x 3
  color     n n_expensive
  <ord> <int>       <int>
1 D      6775        6756
2 E      9797        9758
3 F      9542        9517
4 G     11292       11257
5 H      8304        8274
6 I      5422        5379
7 J      2808        2748

但不是在火花中：

diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'sum((CAST(diamonds.`price` AS BIGINT) > 400L))' due to data type mismatch: function sum requires numeric types, not BooleanType; l
ine 1 pos 33;

Error in eval_bare(call, env) : object 'price' not found

【问题讨论】：

标签： r apache-spark group-by tidyverse sparklyr

【解决方案1】：

类型可能有冲突。将逻辑转换为integer 即可解决问题

library(sparklyr)
library(dplyr)
con <- spark_connect(master = "local")
library(ggplot2)
data(diamonds)
diamonds1 = copy_to(con, diamonds)
diamonds1 %>% 
         group_by(color) %>%
         summarise(n=n(), n_expensive = sum(as.integer(price > 400)))

-输出

【讨论】：

【解决方案2】：

您必须在这里考虑 SQL 表达式，例如if_else：

diamonds_sdl %>% group_by(color) %>% 
  summarise(n=n(), n_expensive=sum(if_else(price > 400, 1, 0)))

sum 有演员表：

diamonds_sdl %>% group_by(color) %>%
   summarise(n=n(), n_expensive=sum(as.numeric(price > 400)))

【讨论】：