【发布时间】:2019-09-12 22:48:17
【问题描述】:
我有两个如下所示的数据框
Df1
+----------------------+---------+
|products |visitorId|
+----------------------+---------+
|[[i1,0.68], [i2,0.42]]|v1 |
|[[i1,0.78], [i3,0.11]]|v2 |
+----------------------+---------+
Df2
+---+----------+
| id| name|
+---+----------+
| i1|Nike Shoes|
| i2| Umbrella|
| i3| Jeans|
+---+----------+
这是数据框 Df1 的架构
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
我想加入 2 个数据框,以便输出为
+------------------------------------------+---------+
|products |visitorId|
+------------------------------------------+---------+
|[[i1,0.68,Nike Shoes], [i2,0.42,Umbrella]]|v1 |
|[[i1,0.78,Nike Shoes], [i3,0.11,Jeans]] |v2 |
+------------------------------------------+---------+
这是我期望的输出架构
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
| | |-- name: double (nullable = true)
|-- visitorId: string (nullable = true)
我如何在 Scala 中做到这一点?我正在使用 Spark 2.2.0。
更新
我对上述数据帧进行了分解和连接,得到了以下输出。
+---------+---+--------+----------+
|visitorId| id|interest| name|
+---------+---+--------+----------+
| v1| i1| 0.68|Nike Shoes|
| v1| i2| 0.42| Umbrella|
| v2| i1| 0.78|Nike Shoes|
| v2| i3| 0.11| Jeans|
+---------+---+--------+----------+
现在,我只需要以下 json 格式的上述数据框。
{
"visitorId": "v1",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.68
}, {
"id": "i2",
"name": "Umbrella",
"interest": 0.42
}]
},
{
"visitorId": "v2",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.78
}, {
"id": "i3",
"name": "Jeans",
"interest": 0.11
}]
}
【问题讨论】:
-
展开、加入、收集列表?
标签: scala apache-spark dataframe hadoop apache-spark-sql