【发布时间】:2021-10-26 13:33:51
【问题描述】:
我有 2 个列数不同的镶木地板文件,并尝试将它们与以下代码 sn-p 合并
Dataset<Row> dataSetParquet1 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataSetParquet2 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\EFG\\efg.parquet");
dataSetParquet1.unionByName(dataSetParquet2);
// dataSetParquet1.union(dataSetParquet2);
对于unionByName() 我得到了错误:
Caused by: org.apache.spark.sql.AnalysisException: Cannot resolve column name
对于union(),我得到了错误:
Caused by: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 7 columns and the second table has 6 columns;;
如何在 java 中使用 spark 合并这些文件?
更新:示例
数据集 1:
epochMillis | one | two | three| four
--------------------------------------
1630670242000 | 1 | 2 | 3 | 4
1630670244000 | 1 | 2 | 3 | 4
1630670246000 | 1 | 2 | 3 | 4
数据集2:
epochMillis | one | two | three|five
---------------------------------------
1630670242000 | 11 | 22 | 33 | 55
1630670244000 | 11 | 22 | 33 | 55
1630670248000 | 11 | 22 | 33 | 55
合并后的最终数据集:
epochMillis | one | two | three|four |five
--------------------------------------------
1630670242000 | 11 | 22 | 33 |4 |55
1630670244000 | 11 | 22 | 33 |4 |55
1630670246000 | 1 | 2 | 3 |4 |null
1630670248000 | 11 | 22 | 33 |null |55
如何获得合并两个数据集的结果?
【问题讨论】:
标签: java apache-spark apache-spark-sql