Dr-Elephant架构原理

文章目录

一、项目介绍
二、架构
三、模块原理

1.数据采集
2.诊断规则
3.优化建议

一、项目介绍

Dr.Elephant 由 LinkedIn 于 2016 年 4 月份开源，是一个 Hadoop 和 Spark 的性能监控和调优工具。Dr.Elephant 能自动化收集所有计算任务指标，进行数据分析，并以简单易用的方式进行呈现。Dr.Elephant 的目标是提高开发人员的开发效率和增加集群任务调试的高效性。

二、架构

Dr.Elephant的架构如下图：
Dr-Elephant架构原理

三、模块原理

1.数据采集

Job Generator: 任务采集

<property>
  <name>drelephant.analysis.thread.count</name>
  <value>3</value>
  <description>Number of threads to analyze the completed jobs 采集线程数</description>
</property>
<property>
  <name>drelephant.analysis.fetch.interval</name>
  <value>60000</value>
  <description>Interval between fetches in milliseconds 采集周期</description>
</property>

调用Yarn的restApi获取周期内完成的任务，和失败的任务

// Fetch all succeeded apps
URL succeededAppsURL = new URL(new URL("http://" + _resourceManagerAddress), String.format(
        "/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=%s&finishedTimeEnd=%s",
        String.valueOf(_lastTime + 1), String.valueOf(_currentTime)));
logger.info("The succeeded apps URL is " + succeededAppsURL);
List<AnalyticJob> succeededApps = readApps(succeededAppsURL);
appList.addAll(succeededApps);

// Fetch all failed apps
// state: Application Master State
// finalStatus: Status of the Application as reported by the Application Master
URL failedAppsURL = new URL(new URL("http://" + _resourceManagerAddress), String.format(
    "/ws/v1/cluster/apps?finalStatus=FAILED&state=FINISHED&finishedTimeBegin=%s&finishedTimeEnd=%s",
    String.valueOf(_lastTime + 1), String.valueOf(_currentTime)));
List<AnalyticJob> failedApps = readApps(failedAppsURL);
logger.info("The failed apps URL is " + failedAppsURL);
appList.addAll(failedApps);

MapReduceFetcher:

<fetcher>
  <applicationtype>mapreduce</applicationtype>
  <classname>com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2</classname>
</fetcher>

获取方式调用 restApi：

_jsonFactory = new JSONFactory();
_jhistoryWebAddr = "http://" + jhistoryAddr + "/jobhistory/job/";

SparkFetcher:解析hdfs路径tmp/spark-logs/logs

<fetcher>
    <applicationtype>spark</applicationtype>
    <classname>com.linkedin.drelephant.spark.fetchers.FSFetcher</classname>
    <params>
        <event_log_size_limit_in_mb>500</event_log_size_limit_in_mb>
        <event_log_location_uri>hdfs://test45cluster/tmp/spark-logs/logs</event_log_location_uri>
    </params>
</fetcher>

2.诊断规则

Severity(严重程度):
CRITICAL(非常严重) > SEVERE(严重) > MODERATE(中等) > LOW(低) > NONE（无）

MapReduce Rules分为6类：
Mapper Skew（Mapper倾斜）
原理：讲所有Mapper分为两部分，比较这两部分的task数据量，执行时间等。计算一个比率。根据比率确定一个严重级别。

Mapper GC（Mapper GC）
原理：GC时间占CPU时间的比例，根据比例给出一个严重级别

Mapper Time（Mapper耗时）
原理：通过mapper运行的时间分析mapper数量是否合适

Mapper Speed（Mapper效率）
原理：分析mapper运行效率

Mapper Spill（Mapper溢写）
原理：Avg spilled records per task/Avg output records per task比例确定一个严重程度

Mapper Memory（Mapper内存）
原理：task 消耗内存 / container 内存比例确定一个严重程度

Reduce Skew（Reduce倾斜）
Reduce GC（Reduce GC）
Reduce Time（Reduce耗时）
Reduce Memory（Reduce内存）
Shuffle&Sort（Shuffle排序）
原理：计算shuffle和sort过程占整个reduce执行时间的占比，根据对应比例估算一个严重级别。

Exception（异常）
Distributed Cache Limit（分布缓存限制）

Spark Rules:
分为
Spark Configuration
原理：spark配置参数

Spark Executor Metrics
原理：spark input/shuffle/storage 内存速率

Spark Job Metrics
原理：失败job/总job=比率

Spark Stage Metrics
原理：失败stage个数/总个数=比率

3.优化建议

每一个分析的指标对应一个通用的建议，不会对每一个任务生产定制化的优化建议
Dr-Elephant架构原理