我不确定我的建议是否可以帮助你,虽然我有类似的情况,我解决了如下:
1) 所以这个想法是使用 json rapture(或其他一些 json 库)来
动态加载 JSON 模式。例如,您可以阅读第一个
json文件的行以发现架构(类似于我所做的
这里有 jsonSchema)
2) 动态生成模式。首先遍历动态
字段(请注意,我将 key3 的值投影为 Map[String, String])
并为它们中的每一个添加一个 StructField 到架构
3) 将生成的架构应用到您的数据帧中
import rapture.json._
import jsonBackends.jackson._
val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}
val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
case(k, v) => {
items.append(StructField(k, StringType))
}
}
val complexColumn = new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
我使用下一个数据作为输入:
{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
还有输出:
+----+-------+--------+--------+--------+
|key1| key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+
我还没有测试过的一个高级替代方案是从 JSON 模式生成一个名为 JsonRow 的案例类,以便拥有一个强类型数据集,它提供更好的序列化性能,除了让你的代码更多可维护。要完成这项工作,您首先需要创建一个 JsonRow.scala 文件,然后您应该实现一个 sbt 预构建脚本,该脚本将根据您的源文件动态修改 JsonRow.scala 的内容(当然您可能有多个)。要动态生成 JsonRow 类,您可以使用以下代码:
def generateClass(members: Map[String, String], name: String) : Any = {
val classMembers = for (m <- members) yield {
s"${m._1}: String"
}
val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
classDef
}
generateClass 方法接受一个字符串映射来创建类成员和类名本身。生成的类的成员你可以再次从你的 json 模式中填充它们:
import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._
val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
for (f <- fields) {
(f.getKey, f.getValue) match {
case (k: String, v: TextNode) => mapping(k) = v.asText
case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
case _ => None
}
}
val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)
打印出来:
case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
祝你好运