【发布时间】:2019-04-11 12:51:46
【问题描述】:
我有一个数据框 joinDf 是通过在 userId 上加入以下四个数据框创建的:
val detailsDf = Seq((123,"first123","xyz"))
.toDF("userId","firstName","address")
val emailDf = Seq((123,"abc@gmail.com"),
(123,"def@gmail.com"))
.toDF("userId","email")
val foodDf = Seq((123,"food2",false,"Italian",2),
(123,"food3",true,"American",3),
(123,"food1",true,"Mediterranean",1))
.toDF("userId","foodName","isFavFood","cuisine","score")
val gameDf = Seq((123,"chess",false,2),
(123,"football",true,1))
.toDF("userId","gameName","isOutdoor","score")
val joinDf = detailsDf
.join(emailDf, Seq("userId"))
.join(foodDf, Seq("userId"))
.join(gameDf, Seq("userId"))
User 的食物和游戏收藏应按分数升序排列。
我正在尝试从此joinDf 创建一个结果,其中 JSON 如下所示:
[
{
"userId": "123",
"firstName": "first123",
"address": "xyz",
"UserFoodFavourites": [
{
"foodName": "food1",
"isFavFood": "true",
"cuisine": "Mediterranean",
},
{
"foodName": "food2",
"isFavFood": "false",
"cuisine": "Italian",
},
{
"foodName": "food3",
"isFavFood": "true",
"cuisine": "American",
}
]
"UserEmail": [
"abc@gmail.com",
"def@gmail.com"
]
"UserGameFavourites": [
{
"gameName": "football",
"isOutdoor": "true"
},
{
"gameName": "chess",
"isOutdoor": "false"
}
]
}
]
我应该使用joinDf.groupBy().agg(collect_set())吗?
任何帮助将不胜感激。
【问题讨论】:
标签: apache-spark dataframe apache-spark-sql apache-spark-dataset