【发布时间】:2021-04-01 12:42:06
【问题描述】:
我是新来的,正在处理大约 20GB 大小的大型数据集(多个小文件),并且需要帮助将这些数据转换为以下格式:
我有这种格式的数据:
+----------+-------------------------+-------------------+---------+------+
| id | values | creation date | leadTime| span |
+----------+-------------------------+-------------------+---------+--+---+
|id_1 |[[v1, 0.368], [v2, 0.5]] | 2020-07-15 | 16 | 15 |
|id_2 |[[v1, 0.368], [v2, 0.4]] | 2020-07-15 | 16 | 15 |
|id_1 |[[v1, 0.468], [v2, 0.3]] | 2020-07-15 | 17 | 18 |
|id_2 |[[v1, 0.368], [v2, 0.3]] | 2020-07-15 | 17 | 18 |
+----------+-------------------------+-------------------+---------+------+
我需要以下格式的数据,方法是使用列字段中的值:
使用 LeadTime 和 span 列值创建具有列名的新列
+----------+--------------+--------------------+--------------------+--------------------+--------------------+
| id |creation date | final_v1_16_15_wk | final_v2_16_15_wk |final_v1_17_18_wk | final_v2_17_18_wk |
+----------+--------------+--------------------+--------------------+--------------------+--------------------+
|id_1 |2020-07-15 | 0.368 | 0.5 | 0.468 | 0.3 |
|id_2 |2020-07-15 | 0.368 | 0.4 | 0.368 | 0.3 |
+----------+--------------+--------------------+--------------------+--------------------+--------------------+
这是示例数据框:
val df = Seq(
("id_1", Map("v1" -> 0.368, "v2" -> 0.5, "v3" -> 0.6), "2020-07-15", 16, 15),
("id_1", Map("v1" -> 0.564, "v2" -> 0.78, "v3" -> 0.65), "2020-07-15", 17, 18),
("id_2", Map("v1" -> 0.468, "v2" -> 0.3, "v3" -> 0.66), "2020-07-15", 16, 15),
("id_2", Map("v1" -> 0.657, "v2" -> 0.65, "v3" -> 0.67), "2020-07-15", 17, 18)).toDF("id", "values", "creation date", "leadTime", "span")
尝试使用以下逻辑生成列名/值,但没有成功:
val modDF = finalDF.withColumn("final_" + newFinalDF("values").getItem(0).getItem("_1") + "_" + newFinalDF("leadTime") + "_" + newFinalDF("span") + "_wk", $"values".getItem(0).getItem("_2"));
【问题讨论】:
标签: scala apache-spark apache-spark-sql