【问题标题】:How to use Dataset to group by, but entire rows如何使用数据集分组,但整行
【发布时间】:2018-08-14 11:42:51
【问题描述】:

阅读this 的帖子我想知道我们如何将一个数据集分组,但有多个列。

喜欢:

val test = Seq(("New York", "Jack", "jdhj"),
    ("Los Angeles", "Tom", "ff"),
    ("Chicago", "David", "ff"),
    ("Houston", "John", "dd"),
    ("Detroit", "Michael", "fff"),
    ("Chicago", "Andrew", "ddd"),
    ("Detroit", "Peter", "dd"),
    ("Detroit", "George", "dkdjkd")
  )

我想得到

芝加哥,[(“大卫”,“ff”),(“安德鲁”,“ddd”)]

【问题讨论】:

    标签: apache-spark dataset


    【解决方案1】:

    我在the link 中建议了您在问题中提供的case class 方式。这是不同的东西。

    RDD方式

    您可以简单地执行以下操作

    val rdd = sc.parallelize(test)      //creating rdd from test
    val resultRdd = rdd.groupBy(x => x._1)              //grouping by the first element
      .mapValues(x => x.map(y => (y._2, y._3)))  //collecting the second and third element in the grouped datset
    

    resultRdd.foreach(println)应该给你

    (New York,List((Jack,jdhj)))
    (Houston,List((John,dd)))
    (Chicago,List((David,ff), (Andrew,ddd)))
    (Detroit,List((Michael,fff), (Peter,dd), (George,dkdjkd)))
    (Los Angeles,List((Tom,ff)))
    

    将 rdd 转换为数据帧

    如果您需要表格格式的输出,您可以在进行一些操作后调用 .toDF()

    val df = resultRdd.map(x => (x._1, x._2.toArray)).toDF()
    

    df.show(false)应该给你

    +-----------+--------------------------------------------+
    |_1         |_2                                          |
    +-----------+--------------------------------------------+
    |New York   |[[Jack,jdhj]]                               |
    |Houston    |[[John,dd]]                                 |
    |Chicago    |[[David,ff], [Andrew,ddd]]                  |
    |Detroit    |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
    |Los Angeles|[[Tom,ff]]                                  |
    +-----------+--------------------------------------------+
    

    【讨论】:

      【解决方案2】:

      如下创建一个案例类

      case class TestData (location: String, name: String, value: String)
      

      虚拟数据

      val test = Seq(("New York", "Jack", "jdhj"),
          ("Los Angeles", "Tom", "ff"),
          ("Chicago", "David", "ff"),
          ("Houston", "John", "dd"),
          ("Detroit", "Michael", "fff"),
          ("Chicago", "Andrew", "ddd"),
          ("Detroit", "Peter", "dd"),
          ("Detroit", "George", "dkdjkd")
        )
      //change each row to TestData object 
          .map(x => TestData(x._1, x._2, x._3))
          .toDS() // create dataset from above data 
      

      根据需要输出

      test.groupBy($"location")
          .agg(collect_list(struct("name", "value")).as("data"))
          .show(false)
      

      输出:

      +-----------+--------------------------------------------+
      |location   |data                                        |
      +-----------+--------------------------------------------+
      |Los Angeles|[[Tom,ff]]                                  |
      |Detroit    |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
      |Chicago    |[[David,ff], [Andrew,ddd]]                  |
      |Houston    |[[John,dd]]                                 |
      |New York   |[[Jack,jdhj]]                               |
      +-----------+--------------------------------------------+
      

      【讨论】:

        猜你喜欢
        • 2017-11-08
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-10-04
        • 1970-01-01
        • 2021-03-30
        • 1970-01-01
        相关资源
        最近更新 更多