使用SparkSql进行表的分析与统计

# 背景我们的数据挖掘平台对数据统计有比较迫切的需求，而Spark本身对数据统计已经做了一些工作，希望梳理一下Spark已经支持的数据统计功能，后期再进行扩展。 # 准备数据在参考文献6中下载[鸢尾花数据](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/)，此处格式为iris.data格式，**先将data后缀改为csv后缀（不影响使用，只是为了保证后续操作不需要修改）**。数据格式如下： | SepalLength | SepalWidth | PetalLength | PetalWidth | Name | | ----------- | ---------- | ----------- | ---------- | ----------- | | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | | 4.9 | 3 | 1.4 | 0.2 | Iris-setosa | | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | | 5 | 3.6 | 1.4 | 0.2 | Iris-setosa | | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa | 数据说明见附录中的`鸢尾花数据`。我们先把数据放到Spark sql数仓中 ```sql CREATE TABLE IF NOT EXISTS iris ( SepalLength FLOAT , SepalWidth FLOAT , PetalLength FLOAT , PetalWidth FLOAT , Species VARCHAR(100) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/mnt/disk1/starqiu/iris'; ``` # 表的分析与统计 Analyze Table语法如下： ```sql ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [analyze_option] ``` > Collect statistics about the table that can be used by the query optimizer to find a better plan. 可以看到Spark表的分析可以**为spark sql做查询优化**，以便得到更好的查询性能。Spark Sql默认使用CBO（基于代价的优化），这在多表join查询时尤其有用。此处的`analyze_option`参数主要分为两类，表统计和列统计。 ## 表统计表的基本统计信息一般包括记录总数和所占空间。 Table statistics用法如下： ```sql ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN] ``` > Collect only basic statistics for the table (number of rows, size in bytes). > > **NOSCAN** > Collect only statistics that do not require scanning the whole table (that is, size in bytes). 运行命令`ANALYZE TABLE iris COMPUTE STATISTICS;`可以得到表的记录总数和所占空间大小。如果不想全表扫描，加上`NOSCAN`关键字，不会全表扫描，但只能得到所占空间大小。表统计信息的描述命令语法如下： ```sql DESCRIBE [EXTENDED] [db_name.]table_name ``` > Return the metadata of an existing table (column names, data types, and comments). If the table does not exist, an exception is thrown. > > **EXTENDED** > Display detailed information about the table, including parent database, table type, storage information, and properties. > Describe Partition 运行`DESCRIBE EXTENDED iris;`，结果如下： ```shell spark-sql> DESCRIBE EXTENDED iris; SepalLength float NULL SepalWidth float NULL PetalLength float NULL PetalWidth float NULL Species string NULL # Detailed Table Information CatalogTable( Table: `default`.`iris` Owner: root Created: Sat Feb 16 17:24:32 CST 2019 Last Access: Thu Jan 01 08:00:00 CST 1970 Type: EXTERNAL Schema: [StructField(SepalLength,FloatType,true), StructField(SepalWidth,FloatType,true), StructField(PetalLength,FloatType,true), StructField(PetalWidth,FloatType,true), StructField(Species,StringType,true)] Provider: hive Properties: [rawDataSize=-1, numFiles=0, transient_lastDdlTime=1550311815, totalSize=0, COLUMN_STATS_ACCURATE=false, numRows=-1] Statistics: sizeInBytes=3808, rowCount=150, isBroadcastable=false Storage(Location: hdfs://data126:8020/mnt/disk1/starqiu/iris, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [field.delim=,, serialization.format=,]) Partition Provider: Catalog) Time taken: 0.112 seconds, Fetched 7 row(s) ``` 通过`Statistics:`可以看到表的记录总数是150条，所占空间3808B，约4KB。 ## 列统计 Column statistics用法如下： ```sql ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...] ``` > Collect column statistics for the specified columns in addition to table statistics. > > Tip > > Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure to collect statistics for all columns used by the query. 列统计的描述命令语法如下： ```sql DESCRIBE [EXTENDED][db_name.]table_name column_name New in version runtime-3.3. ``` > EXTENDED > Display detailed information about the specified columns, including the column statistics collected by the command ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column_name [column_name, ...]. 需要注意的是这个功能在runtime-3.3版本才有的特性，而runtime-3.3封装的是Spark 2.2，会详见文末附录的`databricks Runtime版本与Spark版本的对应关系` 运行命令`ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;`计算指定多列的统计信息，运行`DESCRIBE EXTENDED iris SepalLength;`获取指定一列的统计信息，结果如下： ```shell spark-sql> ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species; Time taken: 4.45 seconds spark-sql> DESCRIBE EXTENDED iris PetalWidth; col_name PetalWidth data_type float comment NULL min 0.10000000149011612 max 2.5 num_nulls 0 distinct_count 21 avg_col_len 4 max_col_len 4 histogram NULL Time taken: 0.104 seconds, Fetched 10 row(s) ``` 目前测试Spark2.2.2不支持该语句，但是Spark2.4.0支持。如果不支持，则可以通过访问hive的元数据库也可以得到这些信息，sql语句如下： ```sql select param_key, param_value from TABLE_PARAMS tp, TBLS t where tp.tbl_id=t.tbl_id and tbl_name = 'iris' and param_key like 'spark.sql.stat%'; ``` 以下是PetalWidth列的统计结果，可以看到包含**不重复的记录数，空值数，最大值、最小值，平均长度以及最大长度** | param_key | param_value | | ------------------------------------------------------ | ------------------- | | spark.sql.statistics.colStats.PetalWidth.avgLen | 4 | | spark.sql.statistics.colStats.PetalWidth.distinctCount | 21 | | spark.sql.statistics.colStats.PetalWidth.max | 2.5 | | spark.sql.statistics.colStats.PetalWidth.maxLen | 4 | | spark.sql.statistics.colStats.PetalWidth.min | 0.10000000149011612 | | spark.sql.statistics.colStats.PetalWidth.nullCount | 0 | | spark.sql.statistics.colStats.PetalWidth.version | 1 | # 总结可以看到这些统计信息不仅对了解数据质量非常有用，对使用Spark sql进行查询也能得到优化，进一步提升速度。后续再写一篇CBO如何利用这些信息进行优化。目前还不清楚Runtime中的Spark功能和开源版的有无差异，但Spark2.4支持表的分析统计操作，建议平台后续项目升级到Spark2.4 。 # 附录 ## 鸢尾花数据说明 Iris数据集是常用的分类实验数据集，由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性。iris以鸢尾花的特征作为数据来源，常用在分类操作中。该数据集由3种不同类型的鸢尾花的50个样本数据构成。其中的一个种类与另外两个种类是线性可分离的，后两个种类是非线性可分离的。 **四个属性：** Sepal.Length（花萼长度），单位是cm; Sepal.Width（花萼宽度），单位是cm; Petal.Length（花瓣长度），单位是cm; Petal.Width（花瓣宽度），单位是cm; **三个种类：** Iris Setosa（山鸢尾）； Iris Versicolour（杂色鸢尾）； Iris Virginica（维吉尼亚鸢尾）。 ## databricks Runtime ![DatabricksUnifiedAnalyticsPlatform](https://img2018.cnblogs.com/blog/1226548/202001/1226548-20200102134042808-1709154063.png) Runtime是databricks 统一分析平台的一部分，官网描述如下： > Accelerate innovation by unifying data science, engineering and business, with the Databricks Unified Analytics Platform, from the original creators of Apache Spark™. Runtime的描述如下： > Simplify operations and get up to 50x better performance with cloud-optimized Apache Spark™. 可以看到主要是基于云优化来简化操作并提升50倍以上的性能。 ## [databricks Runtime版本与Spark版本的对应关系](https://docs.databricks.com/release-notes/runtime/databricks-runtime-ver.html#id6) **Current Releases** | Version | Spark Version | Release Date | Deprecation Announcement | Deprecation Date | | ------- | ------------- | ------------ | ------------------------ | ---------------- | | 5.2 | Spark 2.4 | Jan 24, 2019 | May 27, 2019 | Sep 30, 2019 | | 5.1 | Spark 2.4 | Dec 18, 2018 | Apr 18, 2019 | Aug 19, 2019 | | 5.0 | Spark 2.4 | Nov 08, 2018 | Mar 08, 2019 | Jul 08, 2019 | | 4.3 | Spark 2.3 | Aug 10, 2018 | Dec 09, 2018 | Apr 09, 2019 | | 4.2 | Spark 2.3 | Jul 09, 2018 | Nov 05, 2018 | Mar 05, 2019 | | 3.5-LTS | Spark 2.2 | Dec 21, 2017 | Jan 02, 2019 | Jan 02, 2020 | **Marked for Deprecation** | Version | Spark Version | Release Date | Deprecation Announcement | Deprecation Date | | ------- | ------------- | ------------ | ------------------------ | ---------------- | | 4.3 | Spark 2.3 | Aug 10, 2018 | Dec 09, 2018 | Apr 09, 2019 | | 4.2 | Spark 2.3 | Jul 09, 2018 | Nov 05, 2018 | Mar 05, 2019 | | 3.5-LTS | Spark 2.2 | Dec 21, 2017 | Jan 02, 2019 | Jan 02, 2020 | **Deprecated Releases** | Version | Spark Version | Release Date | Deprecation Announcement | Deprecation Date | | ------------------------- | ------------- | ------------ | ------------------------ | ---------------- | | 4.1 | Spark 2.3 | May 17, 2018 | Sep 17, 2018 | Jan 17, 2019 | | 4.0 | Spark 2.3 | Mar 01, 2018 | Jul 01, 2018 | Nov 01, 2018 | | 3.4 | Spark 2.2 | Nov 20, 2017 | Mar 31, 2018 | Jul 30, 2018 | | 3.3 | Spark 2.2 | Oct 04, 2017 | Mar 31, 2018 | Jul 30, 2018 | | 3.2 | Spark 2.2 | Sep 05, 2017 | Jan 30, 2018 | Apr 30, 2018 | | 3.1 | Spark 2.2 | Aug 04, 2017 | – | Oct 30, 2017 | | 3.0 | Spark 2.2 | Jul 11, 2017 | – | Sep 05, 2017 | | Spark 2.1 (Auto Updating) | Spark 2.1 | Dec 22, 2016 | Mar 31, 2018 | Jul 30, 2018 | | Spark 2.1.1-db6 | Spark 2.1 | Aug 03, 2017 | Mar 31, 2018 | Jul 30, 2018 | | Spark 2.1.1-db5 | Spark 2.1 | May 31, 2017 | – | Aug 03, 2017 | | Spark 2.1.1-db4 | Spark 2.1 | Apr 25, 2017 | Mar 31, 2018 | Jul 30, 2018 | | Spark 2.0 (Auto Updating) | Spark 2.0 | Jul 26, 2016 | Jan 30, 2018 | Apr 30, 2018 | | Spark 2.0.2-db4 | Spark 2.0 | Mar 24, 2017 | Jan 30, 2018 | Apr 30, 2018 | | Spark 1.6.3-db2 | Spark 1.6 | Mar 24, 2017 | Jan 30, 2018 | Jun 30, 2018 | # 参考文献 1. https://docs.databricks.com/spark/latest/spark-sql/language-manual/analyze-table.html 2. https://docs.databricks.com/spark/latest/spark-sql/language-manual/describe-table.html 3. https://docs.databricks.com/spark/latest/spark-sql/cbo.html 4. https://docs.databricks.com/release-notes/runtime/databricks-runtime-ver.html#versioning 5. https://blog.csdn.net/Albert201605/article/details/82313139 6. https://archive.ics.uci.edu/ml/datasets/Iris > 本文由博客一文多发平台 [OpenWrite](https://openwrite.cn?from=article_bottom) 发布！