【发布时间】:2020-07-18 16:16:45
【问题描述】:
我想将以下数据框中的所有 n/a 值替换为 unknown。
它可以是scalar 或complex nested column。
如果是StructField column,我可以遍历列并使用WithColumn 替换n\a。
但我希望这可以在generic way 中完成,尽管该列的type
因为我不想明确指定列名,因为在我的情况下有 100 个?
case class Bar(x: Int, y: String, z: String)
case class Foo(id: Int, name: String, status: String, bar: Seq[Bar])
val df = spark.sparkContext.parallelize(
Seq(
Foo(123, "Amy", "Active", Seq(Bar(1, "first", "n/a"))),
Foo(234, "Rick", "n/a", Seq(Bar(2, "second", "fifth"),Bar(22, "second", "n/a"))),
Foo(567, "Tom", "null", Seq(Bar(3, "second", "sixth")))
)).toDF
df.printSchema
df.show(20, false)
结果:
+---+----+------+---------------------------------------+
|id |name|status|bar |
+---+----+------+---------------------------------------+
|123|Amy |Active|[[1, first, n/a]] |
|234|Rick|n/a |[[2, second, fifth], [22, second, n/a]]|
|567|Tom |null |[[3, second, sixth]] |
+---+----+------+---------------------------------------+
预期输出:
+---+----+----------+---------------------------------------------------+
|id |name|status |bar |
+---+----+----------+---------------------------------------------------+
|123|Amy |Active |[[1, first, unknown]] |
|234|Rick|unknown |[[2, second, fifth], [22, second, unknown]] |
|567|Tom |null |[[3, second, sixth]] |
+---+----+----------+---------------------------------------------------+
对此有何建议?
【问题讨论】:
标签: scala apache-spark apache-spark-sql