在 Spark 中将分层 json 解析为 dataFrame答案

【问题标题】：Parsing hierarchical json to dataFrame in spark在 Spark 中将分层 json 解析为 dataFrame
【发布时间】：2016-11-16 04:19:47
【问题描述】：

我有一个在 hdfs 中构造的 json 文件。我正在尝试在我的 spark 上下文中读取 json 文件。json 文件格式如下

  {"Request": {"TrancheList": {"Tranche": [{"Id": "123","OwnedAmt": "26500000",    "Currency": "USD" }, {  "Id": "456", "OwnedAmt": "41000000","Currency": "USD"}]},"FxRatesList": {"FxRatesContract": [{"Currency": "CHF","FxRate": "0.97919983706115"},{"Currency": "AUD", "FxRate": "1.2966804979253"},{ "Currency": "USD","FxRate": "1"},{"Currency": "SEK","FxRate": "8.1561012531034"},{"Currency": "NOK", "FxRate": "8.2454981641398"}]},"isExcludeDeals": "true","baseCurrency": "USD"}}

    val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json")
    inputdf.printSchema

printSchema 显示以下输出：

root
 |-- Request: struct (nullable = true)
 |    |-- FxRatesList: struct (nullable = true)
 |    |    |-- FxRatesContract: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Currency: string (nullable = true)
 |    |    |    |    |-- FxRate: string (nullable = true)
 |    |-- TrancheList: struct (nullable = true)
 |    |    |-- Tranche: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Currency: string (nullable = true)
 |    |    |    |    |-- OwnedAmt: string (nullable = true)
 |    |    |    |    |-- Id: string (nullable = true)
 |    |-- baseCurrency: string (nullable = true)
 |    |-- isExcludeDeals: string (nullable = true)

在 json 中创建 trancheList 部分的数据帧/RDD 的最佳方法应该是什么，以便它为我提供一个不同的 ID 列表，其中包含 OwnedAmt 和 Currency，如下表所示

  Id       OwnedAmt       Currency
    123      26500000        USD
    456      41000000        USD

任何帮助都会很棒。谢谢

【问题讨论】：

标签： apache-spark dataframe rdd

【解决方案1】：

这是获取此数据的另一种方法。

val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json").select("Request.TrancheList.Tranche");
val dataDF = inputdf.select(explode(inputdf("Tranche"))).toDF("Tranche").select("Tranche.Id", "Tranche.OwnedAmt","Tranche.Currency")
dataDF.show

【讨论】：

【解决方案2】：

您应该能够使用 dot 表示法访问 DataFrame 层次结构中的列。

在本例中，查询将类似于

// Spark 2.0 example; use registerTempTable for Spark 1.6
inputdf.createOrReplaceTempView("inputdf")

spark.sql("select Request.TrancheList.Tranche.Id, Request.TrancheList.Tranche.OwnedAmt, Request.TrancheList.Tranche.Currency from inputdf")

【讨论】：

完全没有问题 - 很高兴为您提供帮助！ :)
它给了我一个像 Id|OwnedAmt| 这样的输出货币| |[123, 456 |[26500000, 41000000]|[USD, USD] 。如何像表格结构一样获取每个 Id 的行
当然，我为我当前的问题创建了stackoverflow.com/questions/40661859/…