【问题标题】:Loading JSON to Spark SQL将 JSON 加载到 Spark SQL
【发布时间】:2020-03-06 16:19:10
【问题描述】:

我正在使用 v2.1 中的 Spark SQL 自学 JSON,并使用链接中的数据

https://catalog.data.gov/dataset/air-quality-measures-on-the-national-environmental-health-tracking-network

我的问题是当我使用时:

val lines = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("E:/VW/meta_plus_sample_Data.json")`

我让 Spark SQL 将所有数据作为一行返回。

+--------------------+--------------------+
|                data|                meta|
+--------------------+--------------------+
|[[row-8eh8_xxkx-u...|[[[[1439474950, t...|
+--------------------+--------------------+`

当我删除时:

.option("multiLine", true).option("mode", "PERMISSIVE")

我得到一个错误

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;

是否有一个选项可以在 Spark SQL 中将文件中的每条记录作为表中的一行来实现?

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    这是预期行为,因为我们只有 one record(在问题中提供的链接中)具有元(对象)和数据(数组)。

    由于one json记录是多行的,所以我们需要包含multiLine选项。

    spark.read.option("multiLine",true).option("mode","PERMISSIVE").json("tmp.json").show()
    
    //sample data
    //+--------------------+--------------------+
    //|                data|                meta|
    //+--------------------+--------------------+
    //|[[row-8eh8_xxkx-u...|[[[[1439474950, t...|
    //+--------------------+--------------------+
    
    //access meta struct columns
    
    df.select("meta.view.*").show()
    //+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
    //|           approvals|averageRating|            category|             columns| createdAt|         description|displayType|downloadCount|               flags|              grants|hideFromCatalog|hideFromDataJson|       id|indexUpdatedAt|            metadata|                name|newBackend|numberOfComments|     oid|               owner|provenance|publicationAppendEnabled|publicationDate|publicationGroup|publicationStage|               query|rights|rowClass|rowsUpdatedAt|rowsUpdatedBy|         tableAuthor|tableId|                tags|totalTimesRated|viewCount|viewLastModified|viewType|
    //+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
    //|[[1439474950, tru...|            0|Environmental Hea...|[[, meta_data,, :...|1439381433|The Environmental...|      table|        26159|[default, restora...|[[[public], false...|          false|           false|cjae-szjv|    1528204279|[[table, fatrow, ...|Air Quality Measu...|      true|               0|12801487|[Tracking, 94g5-7...|  official|                   false|     1439474950|         3957835|       published|[[[true, [2171820...|[read]|        |   1439402317|    94g5-7as2|[Tracking, 94g5-7...|3960642|[environmental ha...|              0|     3843|      1528203875| tabular|
    //+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
    
    //to access data array we need to explode
    df.selectExpr("explode(data)").show()
    //+--------------------+
    //|                 col|
    //+--------------------+
    //|[row-8eh8_xxkx-u3...|
    //|[row-u2v5_78j5-px...|
    //|[row-68zj_7qfn-sx...|
    //|[row-8b4d~zt5j~da...|
    //|[row-5gee.63td_z6...|
    //|[row-tzyx.ssxh_pz...|
    //|[row-3yj2_u42c_mr...|
    //|[row-va7z.p2v8.7p...|
    //|[row-r7kk_e3dm-z2...|
    //|[row-bnrc~w34s-4a...|
    //|[row-ezrk~m5dc_5n...|
    //|[row-nyya.dvnz~c6...|
    //|[row-dq3i_wt6d_c6...|
    //|[row-u6rc-k3mf-cn...|
    //|[row-t9c6-4d4b_r6...|
    //|[row-vq6r~mxzj-e6...|
    //|[row-vxqn-mrpc~5b...|
    //|[row-3akn_5nzm~8v...|
    //|[row-ugxn~bhax.a2...|
    //|[row-ieav.mdz9-m8...|
    //+--------------------+
    

    Load multiple json records:

    //json array with two records
    spark.read.json(Seq(("""
    [{"id":1,"name":"a"},
    {"id":2,"name":"b"}]
    """)).toDS).show()
    
    //as we have 2 json objects and loaded as 2 rows
    //+---+----+
    //| id|name|
    //+---+----+
    //|  1|   a|
    //|  2|   b|
    //+---+----+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-10-11
      • 1970-01-01
      • 2015-10-02
      • 1970-01-01
      • 2016-12-29
      • 1970-01-01
      • 1970-01-01
      • 2017-09-04
      相关资源
      最近更新 更多