【问题标题】:Spark HiveContext - reading from external partitioned Hive table delimiter issueSpark HiveContext - 从外部分区 Hive 表分隔符问题读取
【发布时间】:2016-08-20 01:47:58
【问题描述】:

我有一个带有底层文件 ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 的外部分区 Hive 表 直接通过 Hive 读取数据就可以了,但是当使用 Spark 的 Dataframe API 时,分隔符 '|'不考虑。

创建外部分区表:

hive> create external table external_delimited_table(value1 string, value2 string)
partitioned by (year string, month string, day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
location '/client/edb/poc_database/external_delimited_table';

创建仅包含一行的数据文件并将其放置到外部分区表位置:

shell>echo "one|two" >> table_data.csv
shell>hadoop fs -mkdir -p /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20
shell>hadoop fs -copyFromLocal table_data.csv /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20

激活分区:

hive> alter table external_delimited_table add partition (year='2016',month='08',day='20');

完整性检查:

hive> select * from external_delimited_table;
select * from external_delimited_table;
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| external_delimited_table.value1  | external_delimited_table.value2  | external_delimited_table.year  | external_delimited_table.month  | external_delimited_table.day  |
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| one                              | two                              | 2016                           | 08                              | 20 

火花代码:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
object TestHiveContext {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("Test Hive Context")

    val spark = new SparkContext(conf)
    val hiveContext  = new HiveContext(spark)

    val dataFrame: DataFrame = hiveContext.sql("SELECT * FROM external_delimited_table")
    dataFrame.show()

    spark.stop()
  }

dataFrame.show() 输出:

+-------+------+----+-----+---+
| value1|value2|year|month|day|
+-------+------+----+-----+---+
|one|two|  null|2016|   08| 20|
+-------+------+----+-----+---+

【问题讨论】:

    标签: hive apache-spark-sql hivecontext


    【解决方案1】:

    这原来是 Spark 1.5.0 版的问题。在 1.6.0 版本中不会发生问题:

    scala> sqlContext.sql("select * from external_delimited_table")
    res2: org.apache.spark.sql.DataFrame = [value1: string, value2: string, year: string, month: string, day: string]
    
    scala> res2.show
    +------+------+----+-----+---+
    |value1|value2|year|month|day|
    +------+------+----+-----+---+
    |   one|   two|2016|   08| 20|
    +------+------+----+-----+---+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-03-19
      • 2020-02-06
      • 1970-01-01
      • 1970-01-01
      • 2019-10-27
      • 2018-02-23
      • 2016-09-06
      相关资源
      最近更新 更多