【问题标题】:Issues creating Spark table from EXTERNALLY partitioned data从外部分区数据创建 Spark 表的问题
【发布时间】:2020-06-11 22:09:35
【问题描述】:

CSV 数据每天存储在 AWS S3 上,如下所示:

/data/year=2020/month=5/day=5/<data-part-1.csv, data-part-2.csv,...data-part-K.csv>

我想要处理的查询:

CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
    PARTITIONED BY (year INT, month INT, day INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS TEXTFILE LOCATION '{file_location}' 
    TBLPROPERTIES ('skip.header.line.count' = '1')

结果:表是空的:

  • 尝试更好地指定位置“.../data/year=/month=/day=*”,而不是“.../data/”。

  • 还尝试了运行此命令的建议,但没有成功:
    spark.sql("msck 修复表 database_name.table_name").


下面的这个版本可以加载数据,但我需要年/月/日列,这里的想法是过滤这些以使查询更快:

CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS TEXTFILE LOCATION '{file_location}' 
    TBLPROPERTIES ('skip.header.line.count' = '1')

结果:按预期加载表,但查询很慢。


这个版本也加载一个表,但是,YEAR,MONTH,DAY 列是空的:

CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT, year INT, month INT, day INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS TEXTFILE LOCATION '{file_location}' 
    TBLPROPERTIES ('skip.header.line.count' = '1')

根据文档,我假设第一个查询是加载此数据的正确方法。查看生成的架构,这似乎也是正确的 - 但是我无法让它实际加载任何数据。

有谁知道我做错了什么?

【问题讨论】:

  • 你试过ALTER TABLE {table_name} RECOVER PARTITIONS吗?我也希望你指定文件位置直到分区的根目录

标签: apache-spark databricks


【解决方案1】:

检查这是否有帮助-

请注意sparkSession 是在没有配置单元支持的情况下创建的

1。创建虚拟测试数据框并将其存储为带有yearmonthday 分区的csv

 val df = spark.range(1).withColumn("date",
      explode(sequence(to_date(lit("2020-06-09")), to_date(lit("2020-06-20")), expr("interval 1 day")))
    ).withColumn("year", year($"date"))
      .withColumn("month", month($"date"))
      .withColumn("day", dayofmonth($"date"))
    df.show(false)
    df.printSchema()

    /**
      * +---+----------+----+-----+---+
      * |id |date      |year|month|day|
      * +---+----------+----+-----+---+
      * |0  |2020-06-09|2020|6    |9  |
      * |0  |2020-06-10|2020|6    |10 |
      * |0  |2020-06-11|2020|6    |11 |
      * |0  |2020-06-12|2020|6    |12 |
      * |0  |2020-06-13|2020|6    |13 |
      * |0  |2020-06-14|2020|6    |14 |
      * |0  |2020-06-15|2020|6    |15 |
      * |0  |2020-06-16|2020|6    |16 |
      * |0  |2020-06-17|2020|6    |17 |
      * |0  |2020-06-18|2020|6    |18 |
      * |0  |2020-06-19|2020|6    |19 |
      * |0  |2020-06-20|2020|6    |20 |
      * +---+----------+----+-----+---+
      *
      * root
      * |-- id: long (nullable = false)
      * |-- date: date (nullable = false)
      * |-- year: integer (nullable = false)
      * |-- month: integer (nullable = false)
      * |-- day: integer (nullable = false)
      */
    df.repartition(2).write.partitionBy("year", "month", "day")
      .option("header", true)
      .mode(SaveMode.Overwrite)
      .csv("/Users/sokale/models/hive_table")

文件结构

    /**
      * File structure - /Users/sokale/models/hive_table
      * ---------------
      * year=2020
      * year=2020/month=6
      * year=2020/month=6/day=10
      * |- part...csv files (same part files for all the below directories)
      * year=2020/month=6/day=11
      * year=2020/month=6/day=12
      * year=2020/month=6/day=13
      * year=2020/month=6/day=14
      * year=2020/month=6/day=15
      * year=2020/month=6/day=16
      * year=2020/month=6/day=17
      * year=2020/month=6/day=18
      * year=2020/month=6/day=19
      * year=2020/month=6/day=20
      * year=2020/month=6/day=9
      */

读取分区表

val csvDF = spark.read.option("header", true)
      .csv("/Users/sokale/models/hive_table")

    csvDF.show(false)
    csvDF.printSchema()

    /**
      * +---+----------+----+-----+---+
      * |id |date      |year|month|day|
      * +---+----------+----+-----+---+
      * |0  |2020-06-20|2020|6    |20 |
      * |0  |2020-06-19|2020|6    |19 |
      * |0  |2020-06-09|2020|6    |9  |
      * |0  |2020-06-12|2020|6    |12 |
      * |0  |2020-06-10|2020|6    |10 |
      * |0  |2020-06-15|2020|6    |15 |
      * |0  |2020-06-16|2020|6    |16 |
      * |0  |2020-06-17|2020|6    |17 |
      * |0  |2020-06-13|2020|6    |13 |
      * |0  |2020-06-18|2020|6    |18 |
      * |0  |2020-06-14|2020|6    |14 |
      * |0  |2020-06-11|2020|6    |11 |
      * +---+----------+----+-----+---+
      *
      * root
      * |-- id: string (nullable = true)
      * |-- date: string (nullable = true)
      * |-- year: integer (nullable = true)
      * |-- month: integer (nullable = true)
      * |-- day: integer (nullable = true)
      */

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-12-27
    • 1970-01-01
    • 2017-10-19
    • 1970-01-01
    相关资源
    最近更新 更多