我们可以从 AVRO 模式自动生成 Spark SQL 查询吗？答案

【问题标题】：Can we automate spark SQL query generation from AVRO schema?我们可以从 AVRO 模式自动生成 Spark SQL 查询吗？
【发布时间】：2020-07-12 16:26:47
【问题描述】：

我正在处理每天需要处理大量 AVRO 文件的项目。为了从 AVRO 中提取数据，我使用 sparkSQL。为了实现这一点，我首先需要 printSchema，然后我需要选择字段来查看数据。我想自动化这个过程。给定任何输入 AVRO，我想编写一个脚本，该脚本将自动生成 SparkSQL 查询（考虑 avsc 文件中的结构和数组）。我可以用 Java 或 Python 编写脚本。

-- 样本输入 AVRO

root
|-- identifier: struct (nullable = true)
|    |-- domain: string (nullable = true)
|    |-- id: string (nullable = true)
|    |-- version: long (nullable = true)
alternativeIdentifiers: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- identifier: struct (nullable = true)
|    |    |    |    |-- domain: string (nullable = true)
|    |    |    |    |-- id: string (nullable = true)

-- 我期待的输出

SELECT identifier.domain, identifier.id, identifier.version

【问题讨论】：

添加一个样本输入和您的预期输出？？
简单的select * from table 不起作用？
它可以工作，但不是我想要的。基本上 select * from 以列格式显示展平字段，以数组格式显示结构和数组，如 [col a, col b, col c]。我的动机是生成自动查询或获取字段名称及其字段类型和父字段

标签： apache-spark apache-spark-sql avro spark-avro

【解决方案1】：

您可以使用这样的东西来根据架构生成列列表：

  import org.apache.spark.sql.types.{StructField, StructType}
  def getStructFieldName(f: StructField, baseName: String = ""): Seq[String] = {
    val bname = if (baseName.isEmpty) "" else baseName + "."
    f.dataType match {
      case StructType(s) =>
        s.flatMap(x => getStructFieldName(x, bname + f.name))
      case _ => Seq(bname + f.name)
    }
  }

然后就可以在真实的dataframe上使用了，像这样：

val data = spark.read.json("some_data.json")
val cols = data.schema.flatMap(x => getStructFieldName(x))

因此，我们得到了字符串序列，我们可以使用它来执行select：

import org.apache.spark.sql.functions.col
data.select(cols.map(col): _*)

或者我们可以生成一个逗号分隔的列表，我们可以在spark.sql中使用：

spark.sql(s"select ${cols.mkString(", ")} from table")

【讨论】：

嗨，亚历克斯，感谢您的回复。对不起，这可能是小问题。我有点困惑我需要在代码中的哪里传递我的 avro 模式文件？
@Teja data.schema.flatMap 部分