Spark SQL 无法递归读取配置单元表的 HDFS 子文件夹 (Spark - 2.4.6)答案

【问题标题】：Spark SQL not able to read HDFS subfolders recursively of a hive table (Spark - 2.4.6)Spark SQL 无法递归读取配置单元表的 HDFS 子文件夹 (Spark - 2.4.6)
【发布时间】：2022-01-23 12:11:56
【问题描述】：

我们正在尝试使用 Spark-SQL 读取配置单元表，但它没有显示任何记录（在输出中给出 0 条记录）。在检查我们发现表的HDFS文件存储在多个子目录中 -

hive> [hadoop@ip-10-37-195-106 CDPJobs]$ hdfs dfs -ls /its/cdp/refn/cot_tbl_cnt_hive/     
Found 18 items     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/1     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/10     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/11     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/12     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/13     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/14     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/15

我们尝试在 spark-defaults.conf 文件中设置以下属性，但问题仍然存在。

set spark.hadoop.hive.supports.subdirectories = true;    
set spark.hadoop.hive.mapred.supports.subdirectories = true;     
set spark.hadoop.hive.input.dir.recursive=true;     
set mapreduce.input.fileinputformat.input.dir.recursive=true;          
set recursiveFileLookup=true;            
set spark.hive.mapred.supports.subdirectories=true;         
set spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true;

有人知道这个问题的任何解决方案吗？我们使用的是 Spark 2.4.6 版。

更新（找到解决方案）-

我已将此属性更改为 false，现在 spark 可以从子目录中读取数据了。

设置 spark.sql.hive.convertMetastoreOrc=false;

【问题讨论】：

标签： apache-spark hadoop hive apache-spark-sql hdfs

【解决方案1】：

sparkSession = (SparkSession
                    .builder
                    .appName('USS - Unified Scheme of Sells')
                    .config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
                    .config("hive.input.dir.recursive", "true")
                    .config("hive.mapred.supports.subdirectories", "true")
                    .config("hive.supports.subdirectories", "true")
                    .config("mapred.input.dir.recursive", "true")
                    .enableHiveSupport()
                    .getOrCreate()
                    )

【讨论】：

我已经在 spark 中尝试过使用这些属性，但它不起作用......