是否可以使用没有固定架构的 json 数据创建数据框列？答案

【问题标题】：Is it possible to create a dataframe column with json data which doesn't have a fixed schema?是否可以使用没有固定架构的 json 数据创建数据框列？
【发布时间】：2020-07-23 23:56:44
【问题描述】：

我正在尝试使用没有固定架构的 JSON 数据创建一个数据框列。我正在尝试以原始形式将其编写为地图/对象，但出现各种错误。

我不想将其转换为字符串，因为我需要将这些数据以原始形式写入文件。

稍后此文件用于json 处理，不应破坏原始结构。

目前，当我尝试将数据写入文件时，它包含所有转义字符，并将整个 json 视为字符串而不是复杂类型。例如

    {"field1":"d1","field2":"app","value":"{\"data\":\"{\\\"app\\\":\\\"am\\\"}\"}"}

【问题讨论】：

标签： json scala apache-spark apache-spark-sql databricks

【解决方案1】：

您可以尝试为json 文件构建一个架构。

我不知道你期望什么输出。

作为线索，我给你一个例子和两个有趣的链接：

spark-read-json-with-schema

spark-schema-explained-with-examples

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructType}

object RareJson {
  val spark = SparkSession
    .builder()
    .appName("RareJson")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id","RareJson") // To silence Metrics warning
    .getOrCreate()

  val sc = spark.sparkContext

  val sqlContext = spark.sqlContext

  val input = "/home/cloudera/files/tests/rare.json"

  def main(args: Array[String]): Unit = {

    Logger.getRootLogger.setLevel(Level.ERROR)

    try {
      val structureSchema = new StructType()
        .add("field1",StringType)
        .add("field2",StringType)
        .add("value",StringType,true)

      val rareJson = sqlContext
        .read
        .option("allowBackslashEscapingAnyCharacter", true)
        .option("allowUnquotedFieldNames", true)
        .option("multiLine", true)
        .option("mode", "DROPMALFORMED")
        .schema(structureSchema)
        .json(input)

      rareJson.show(truncate = false)

      // To have the opportunity to view the web console of Spark: http://localhost:4041/
      println("Type whatever to the console to exit......")
      scala.io.StdIn.readLine()
    } finally {
      sc.stop()
      println("SparkContext stopped")
      spark.stop()
      println("SparkSession stopped")
    }
  }
}

输出

+------+------+---------------------------+
|field1|field2|value                      |
+------+------+---------------------------+
|d1    |app   |{"data":"{\"app\":\"am\"}"}|
+------+------+---------------------------+

如果 value 列在所有行中保持相同的格式，您也可以尝试解析它。

【讨论】：