【发布时间】:2020-11-10 17:11:42
【问题描述】:
我需要为 Spark 定义一个带有 ArrayType 的测试样本来读取这些数据。 数据架构如下所示:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: integer (nullable = true)
| | |-- stat: float (nullable = true)
|-- naming: string (nullable = true)
我当前对数据字段的定义显示所有行的空值,那么如何在 CSV 文件中结构化地定义这些数据?
这是我的 CSV 文件结构现在的样子:
"data1_id","data1_stat","data2_id","data2_stat","data3_id","data3_stat","naming"
"1","0.76","2","0.55","3","0.16","Default1"
"1","0.2","2","0.41","3","0.89","Default2"
"1","0.96","2","0.12","3","0.4","Default3"
"1","0.28","2","0.15","3","0.31","Default4"
"1","0.84","2","0.41","3","0.15","Default5"
当我在输入数据帧上调用 show 时,我得到了这个结果:
+-------+-----------+
|data |naming |
+-------+-----------+
|null |Default1 |
|null |Default2 |
|null |Default3 |
|null |Default4 |
|null |Default5 |
+-------+-----------+
预期结果:
+----------------------------+-----------+
|data |naming |
+----------------------------+-----------+
|[[1,0.76],[2,0.55],[3,0.16]]|Default1 |
|[[1,0.2],[2,0.41],[3,0.89]] |Default2 |
|[[1,0.96],[2,0.12],[3,0.4]] |Default3 |
|[[1,0.28],[2,0.15],[3,0.31]]|Default4 |
|[[1,0.84],[2,0.41],[3,0.15]]|Default5 |
+----------------------------+-----------+
【问题讨论】:
标签: csv apache-spark