【发布时间】:2018-06-07 14:02:33
【问题描述】:
我的要求是使用 Spark Scala DataFrame 仅写入 Header CSV 记录。谁能帮我解决这个问题。
val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/"
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.
1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.
【问题讨论】:
-
sc.parallelize(Seq(df.columns.mkString(","))).saveAsTextFile -
我正在使用 sparksession,因此并行化时出错不是成员。我从数据框中只选择了 10 列中的 3 列作为输出。
标签: scala apache-spark apache-spark-sql