spark historyServer

作者：明翼（XGogo）
出处：http://www.cnblogs.com/seaspring/
本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。
不能用于商业用户，若商业使用请联系：
-------------
QQ:107463366
微信:shinelife
-------------

*************************************************************************************************************************************************

在默认配置的情况下启动spark history server 报错:

[[email protected] sbin]# ./start-history-server.sh

Exception in thread "main" java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:258)

at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

Caused by: java.lang.IllegalArgumentException: Log directory specified does not exist: file:/tmp/spark-events. Did you configure the correct one through spark.history.fs.logDirectory?

at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)

at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:153)

at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:149)

at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:75)

... 6 more

基本配置如下:

spark-defaults.conf

spark.eventLog.enabled true

spark.eventLog.dir hdfs://ns/directory

# hdfs://ns/directory事先需要在hdfs上存在，否则会报错，ns是自己的集群名称

spark.eventLog.compress true

spark-env.sh

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080

-Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://ns/directory spark.history.fs.cleaner.interval=1d spark.history.fs.cleaner.maxAge=2d"

再次启动start-history-server.sh

访问spark的HistoryServer的WEBUI： http://hadoop03:18080

spark historyServer

刚启动时是空的,上面是运行了sparkSQL和sparkPI出现的历史job。

查看其中的一个App ID

spark historyServer

查看hdfs目录:/directory

spark historyServer

history server相关的配置参数描述

1） spark.history.updateInterval
　　默认值：10
　　以秒为单位，更新日志相关信息的时间间隔

2）spark.history.retainedApplications
　　默认值：50
　　在内存中保存Application历史记录的个数，如果超过这个值，旧的应用程序信息将被删除，当再次访问已被删除的应用信息时需要重新构建页面。

3）spark.history.ui.port
　　默认值：18080
HistoryServer的web端口

4）spark.history.kerberos.enabled
　　默认值：false
　　是否使用kerberos方式登录访问HistoryServer，对于持久层位于安全集群的HDFS上是有用的，如果设置为true，就要配置下面的两个属性

5）spark.history.kerberos.principal
　　默认值：用于HistoryServer的kerberos主体名称

6）spark.history.kerberos.keytab
　　用于HistoryServer的kerberos keytab文件位置

7）spark.history.ui.acls.enable
　　默认值：false
　　授权用户查看应用程序信息的时候是否检查acl。如果启用，只有应用程序所有者和spark.ui.view.acls指定的用户可以查看应用程序信息;否则，不做任何检查

8）spark.eventLog.enabled
　　默认值：false
　　是否记录Spark事件，用于应用程序在完成后重构webUI

9）spark.eventLog.dir
　　默认值：file:///tmp/spark-events
　　保存日志相关信息的路径，可以是hdfs://开头的HDFS路径，也可以是file://开头的本地路径，都需要提前创建

10）spark.eventLog.compress
　　默认值：false
　　是否压缩记录Spark事件，前提spark.eventLog.enabled为true，默认使用的是snappy

以spark.history开头的需要配置在spark-env.sh中的SPARK_HISTORY_OPTS，以spark.eventLog开头的配置在spark-defaults.conf

Security options for the Spark History Server are covered more detail in the Security page.

Property Name	Default	Meaning
spark.history.provider	org.apache.spark.deploy.history.FsHistoryProvider	Name of the class implementing the application history backend. Currently there is only one implementation, provided by Spark, which looks for application logs stored in the file system.
spark.history.fs.logDirectory	file:/tmp/spark-events	For the filesystem history provider, the URL to the directory containing application event logs to load. This can be a local file:// path, an HDFS path hdfs://namenode/shared/spark-logs or that of an alternative filesystem supported by the Hadoop APIs.
spark.history.fs.update.interval	10s	The period at which the filesystem history provider checks for new or updated logs in the log directory. A shorter interval detects new applications faster, at the expense of more server load re-reading updated applications. As soon as an update has completed, listings of the completed and incomplete applications will reflect the changes.
spark.history.retainedApplications	50	The number of applications to retain UI data for in the cache. If this cap is exceeded, then the oldest applications will be removed from the cache. If an application is not in the cache, it will have to be loaded from disk if it is accessed from the UI.
spark.history.ui.maxApplications	Int.MaxValue	The number of applications to display on the history summary page. Application UIs are still available by accessing their URLs directly even if they are not displayed on the history summary page.
spark.history.ui.port	18080	The port to which the web interface of the history server binds.
spark.history.kerberos.enabled	false	Indicates whether the history server should use kerberos to login. This is required if the history server is accessing HDFS files on a secure Hadoop cluster. If this is true, it uses the configs spark.history.kerberos.principalandspark.history.kerberos.keytab.
spark.history.kerberos.principal	(none)	Kerberos principal name for the History Server.
spark.history.kerberos.keytab	(none)	Location of the kerberos keytab file for the History Server.
spark.history.fs.cleaner.enabled	false	Specifies whether the History Server should periodically clean up event logs from storage.
spark.history.fs.cleaner.interval	1d	How often the filesystem job history cleaner checks for files to delete. Files are only deleted if they are older than spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxAge	7d	Job history files older than this will be deleted when the filesystem history cleaner runs.
spark.history.fs.endEventReparseChunkSize	1m	How many bytes to parse at the end of log files looking for the end event. This is used to speed up generation of application listings by skipping unnecessary parts of event log files. It can be disabled by setting this config to 0.
spark.history.fs.inProgressOptimization.enabled	true	Enable optimized handling of in-progress logs. This option may leave finished applications that fail to rename their event logs listed as in-progress.
spark.history.fs.numReplayThreads	25% of available cores	Number of threads that will be used by history server to process event logs.
spark.history.store.maxDiskUsage	10g	Maximum disk usage for the local directory where the cache application history information are stored.
spark.history.store.path	(none)	Local directory where to cache application history data. If set, the history server will store application data on disk instead of keeping it in memory. The data written to disk will be re-used in the event of a history server restart.

来自 <http://spark.apache.org/docs/latest/monitoring.html>

疑问1：spark.history.fs.logDirectory和spark.eventLog.dir指定目录有啥区别？

经测试后发现：

spark.eventLog.dir：Application在运行过程中所有的信息均记录在该属性指定的路径下；

spark.history.fs.logDirectory：Spark History Server页面只展示该指定路径下的信息；

比如：spark.eventLog.dir刚开始时指定的是hdfs://hadoop000:8020/directory，而后修改成hdfs://hadoop000:8020/directory2

那么spark.history.fs.logDirectory如果指定的是hdfs://hadoop000:8020/directory，就只能显示出该目录下的所有Application运行的日志信息；反之亦然。

疑问2：spark.history.retainedApplications=3 貌似没生效？？？？？？

The History Server will list all applications. It will just retain a max number of them in memory. That option does not control how many applications are show, it controls how much memory the HS will need.

注意：该参数并不是也页面中显示的application的记录数，而是存放在内存中的个数，内存中的信息在访问页面时直接读取渲染既可；

比如说该参数配置了10个，那么内存中就最多只能存放10个applicaiton的日志信息，当第11个加入时，第一个就会被踢除，当再次访问第1个application的页面信息时就需要重新读取指定路径上的日志信息来渲染展示页面。

详见官方文档：http://spark.apache.org/docs/latest/monitoring.html