【问题标题】:Group by and find count before doing pivot spark在进行枢轴火花之前分组并查找计数
【发布时间】:2019-03-17 17:06:12
【问题描述】:

我有一个如下所示的数据框

A   B   C       D
foo one small   1
foo one large   2
foo one large   2
foo two small   3

我需要 groupBy 基于 A 和 B pivot 在 C 列和 sum D 列

我可以使用

来做到这一点
df.groupBy("A", "B").pivot("C").sum("D") 

但是,如果我尝试类似的方法,我还需要在 groupBy 之后找到 count

df.groupBy("A", "B").pivot("C").agg(sum("D"), count)

我得到像

这样的输出
A   B   large   small large_count small_count

有没有办法在groupBy之后只得到一个count,然后再做pivot

【问题讨论】:

    标签: scala apache-spark databricks


    【解决方案1】:

    输出试试

    output.withColumn("count", $"large_count"+$"small_count").show

    如果您愿意,可以删除两个计数列

    在pivot try之前做 df.groupBy("A", "B").agg(count("C"))

    【讨论】:

    • 我不能做df.groupBy("A", "B").agg(count("C")) 因为数据透视仅适用于按数据分组
    • 您能发布您希望输出的样子吗?不知道你到底想要什么
    【解决方案2】:

    这是你所期待的吗?

    val df = Seq(("foo", "one", "small",   1),
    ("foo", "one", "large",   2),
    ("foo", "one", "large",   2),
    ("foo", "two", "small",   3)).toDF("A","B","C","D")
    
    scala> df.show
    +---+---+-----+---+
    |  A|  B|    C|  D|
    +---+---+-----+---+
    |foo|one|small|  1|
    |foo|one|large|  2|
    |foo|one|large|  2|
    |foo|two|small|  3|
    +---+---+-----+---+
    
    scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
    df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
    
    scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
    df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
    
    scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
    +---+---+----+-----+-----+
    |  A|  B|sumd|large|small|
    +---+---+----+-----+-----+
    |foo|one|   5|    4|    1|
    |foo|two|   3| null|    3|
    +---+---+----+-----+-----+
    
    
    scala>
    

    【讨论】:

    • 如果我的数据框很大怎么办,分组和加入会对性能产生巨大影响
    • 这取决于列 A 和 B 的唯一性。试一试!
    【解决方案3】:

    这不需要加入。这是你要找的吗?

    val df = Seq(("foo", "one", "small",   1),
    ("foo", "one", "large",   2),
    ("foo", "one", "large",   2),
    ("foo", "two", "small",   3)).toDF("A","B","C","D")
    
    scala> df.show
    +---+---+-----+---+
    |  A|  B|    C|  D|
    +---+---+-----+---+
    |foo|one|small|  1|
    |foo|one|large|  2|
    |foo|one|large|  2|
    |foo|two|small|  3|
    +---+---+-----+---+
    
    df.registerTempTable("dummy")
    
    spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
    
    +---+---+-----+-----+-----+
    |  A|  B|large|small|total|
    +---+---+-----+-----+-----+
    |foo|one|    4|    1|    5|
    |foo|two| null|    3|    3|
    +---+---+-----+-----+-----+
    

    【讨论】:

      猜你喜欢
      • 2020-07-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-05-15
      • 2018-03-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多