【发布时间】:2017-08-31 16:58:55
【问题描述】:
我有一个RDD[(Seq[String], Seq[String])],数据中有一些空值。
转成dataframe的RDD是这样的
+----------+----------+
| col1| col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+
以下是示例代码:
val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))
val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))
val transposedDf = sc.createDataFrame(transposed, schema)
transposed.show()
在我创建 transposedDF 之前它运行良好,但是一旦我点击显示它就会引发以下错误:
scala.MatchError: null
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
如果 rdd 中没有空值,则代码可以正常工作。我不明白为什么当我有任何空值时它会失败,因为我将 StringType 的模式指定为 nullable 为真。难道我做错了什么?我正在使用 spark 1.6.1 和 scala 2.10
【问题讨论】:
标签: scala apache-spark