作者:明翼(XGogo) 
出处:http://www.cnblogs.com/seaspring/ 
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。 
不能用于商业用户,若商业使用请联系: 
------------- 
QQ:107463366 
微信:shinelife 
-------------

*************************************************************************************************************************************************

在默认配置的情况下启动spark history server 报错:

[[email protected] sbin]# ./start-history-server.sh

Exception in thread "main" java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:258)

        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

Caused by: java.lang.IllegalArgumentException: Log directory specified does not exist: file:/tmp/spark-events. Did you configure the correct one through spark.history.fs.logDirectory?

        at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)

        at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:153)

        at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:149)

        at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:75)

        ... 6 more

 

基本配置如下:

spark-defaults.conf

spark.eventLog.enabled  true

spark.eventLog.dir      hdfs://ns/directory

# hdfs://ns/directory事先需要在hdfs上存在,否则会报错,ns是自己的集群名称

spark.eventLog.compress true

spark-env.sh

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080

-Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://ns/directory spark.history.fs.cleaner.interval=1d spark.history.fs.cleaner.maxAge=2d"

 

再次启动start-history-server.sh

访问spark的HistoryServerWEBUI: http://hadoop03:18080

spark historyServer

刚启动时是空的,上面是运行了sparkSQL和sparkPI出现的历史job。

查看其中的一个App ID

spark historyServer

 查看hdfs目录:/directory

spark historyServer

history server相关的配置参数描述

1) spark.history.updateInterval
  默认值:10
  以秒为单位,更新日志相关信息的时间间隔

2)spark.history.retainedApplications
  默认值:50
  在内存中保存Application历史记录的个数,如果超过这个值,旧的应用程序信息将被删除,当再次访问已被删除的应用信息时需要重新构建页面。

3)spark.history.ui.port
  默认值:18080
HistoryServer的web端口

4)spark.history.kerberos.enabled
  默认值:false
  是否使用kerberos方式登录访问HistoryServer,对于持久层位于安全集群的HDFS上是有用的,如果设置为true,就要配置下面的两个属性

5)spark.history.kerberos.principal
  默认值:用于HistoryServer的kerberos主体名称

6)spark.history.kerberos.keytab
  用于HistoryServer的kerberos keytab文件位置

7)spark.history.ui.acls.enable
  默认值:false
  授权用户查看应用程序信息的时候是否检查acl。如果启用,只有应用程序所有者和spark.ui.view.acls指定的用户可以查看应用程序信息;否则,不做任何检查

8)spark.eventLog.enabled
  默认值:false
  是否记录Spark事件,用于应用程序在完成后重构webUI

9)spark.eventLog.dir
  默认值:file:///tmp/spark-events
  保存日志相关信息的路径,可以是hdfs://开头的HDFS路径,也可以是file://开头的本地路径,都需要提前创建

10)spark.eventLog.compress
  默认值:false
  是否压缩记录Spark事件,前提spark.eventLog.enabled为true,默认使用的是snappy

以spark.history开头的需要配置在spark-env.sh中的SPARK_HISTORY_OPTS,以spark.eventLog开头的配置在spark-defaults.conf

 

Security options for the Spark History Server are covered more detail in the Security page.

Property Name

Default

Meaning

spark.history.provider

org.apache.spark.deploy.history.FsHistoryProvider

Name of the class implementing the application history backend. Currently there is only one implementation, provided by Spark, which looks for application logs stored in the file system.

spark.history.fs.logDirectory

file:/tmp/spark-events

For the filesystem history provider, the URL to the directory containing application event logs to load. This can be a local file:// path, an HDFS path hdfs://namenode/shared/spark-logs or that of an alternative filesystem supported by the Hadoop APIs.

spark.history.fs.update.interval

10s

The period at which the filesystem history provider checks for new or updated logs in the log directory. A shorter interval detects new applications faster, at the expense of more server load re-reading updated applications. As soon as an update has completed, listings of the completed and incomplete applications will reflect the changes.

 

spark.history.retainedApplications

50

The number of applications to retain UI data for in the cache. If this cap is exceeded, then the oldest applications will be removed from the cache. If an application is not in the cache, it will have to be loaded from disk if it is accessed from the UI.

spark.history.ui.maxApplications

Int.MaxValue

The number of applications to display on the history summary page. Application UIs are still available by accessing their URLs directly even if they are not displayed on the history summary page.

spark.history.ui.port

18080

The port to which the web interface of the history server binds.

spark.history.kerberos.enabled

false

Indicates whether the history server should use kerberos to login. This is required if the history server is accessing HDFS files on a secure Hadoop cluster. If this is true, it uses the configs spark.history.kerberos.principalandspark.history.kerberos.keytab.

spark.history.kerberos.principal

(none)

Kerberos principal name for the History Server.

spark.history.kerberos.keytab

(none)

Location of the kerberos keytab file for the History Server.

spark.history.fs.cleaner.enabled

false

Specifies whether the History Server should periodically clean up event logs from storage.

spark.history.fs.cleaner.interval

1d

How often the filesystem job history cleaner checks for files to delete. Files are only deleted if they are older than spark.history.fs.cleaner.maxAge

spark.history.fs.cleaner.maxAge

7d

Job history files older than this will be deleted when the filesystem history cleaner runs.

spark.history.fs.endEventReparseChunkSize

1m

How many bytes to parse at the end of log files looking for the end event. This is used to speed up generation of application listings by skipping unnecessary parts of event log files. It can be disabled by setting this config to 0.

spark.history.fs.inProgressOptimization.enabled

true

Enable optimized handling of in-progress logs. This option may leave finished applications that fail to rename their event logs listed as in-progress.

spark.history.fs.numReplayThreads

25% of available cores

Number of threads that will be used by history server to process event logs.

spark.history.store.maxDiskUsage

10g

Maximum disk usage for the local directory where the cache application history information are stored.

spark.history.store.path

(none)

Local directory where to cache application history data. If set, the history server will store application data on disk instead of keeping it in memory. The data written to disk will be re-used in the event of a history server restart.

来自 <http://spark.apache.org/docs/latest/monitoring.html>

疑问1:spark.history.fs.logDirectory和spark.eventLog.dir指定目录有啥区别?

经测试后发现:

spark.eventLog.dir:Application在运行过程中所有的信息均记录在该属性指定的路径下;

spark.history.fs.logDirectory:Spark History Server页面只展示该指定路径下的信息;

比如:spark.eventLog.dir刚开始时指定的是hdfs://hadoop000:8020/directory,而后修改成hdfs://hadoop000:8020/directory2

那么spark.history.fs.logDirectory如果指定的是hdfs://hadoop000:8020/directory,就只能显示出该目录下的所有Application运行的日志信息;反之亦然。

 

疑问2:spark.history.retainedApplications=3 貌似没生效??????

The History Server will list all applications. It will just retain a max number of them in memory. That option does not control how many applications are show, it controls how much memory the HS will need.

注意:该参数并不是也页面中显示的application的记录数,而是存放在内存中的个数,内存中的信息在访问页面时直接读取渲染既可;

比如说该参数配置了10个,那么内存中就最多只能存放10个applicaiton的日志信息,当第11个加入时,第一个就会被踢除,当再次访问第1个application的页面信息时就需要重新读取指定路径上的日志信息来渲染展示页面。 

详见官方文档:http://spark.apache.org/docs/latest/monitoring.html

 

 

 

相关文章:

  • 2022-01-25
  • 2021-12-28
  • 2021-07-17
  • 2021-11-06
  • 2021-04-29
  • 2021-08-11
  • 2021-10-20
  • 2021-10-15
猜你喜欢
  • 2021-04-04
  • 2021-07-17
  • 2022-12-23
  • 2021-06-21
  • 2022-12-23
  • 2021-11-08
相关资源
相似解决方案