用于 mllib kmeans 算法的 EMR 集群上的 Spark 配置答案

【问题标题】：Spark configuration on EMR cluster for mllib kmeans algorithm用于 mllib kmeans 算法的 EMR 集群上的 Spark 配置
【发布时间】：2016-10-04 00:47:24
【问题描述】：

我正在尝试使用 Spark MLlib 在大约 3000,000 行和 2048 列的巨大矩阵上执行 KMeans。该矩阵的大小约为 76GB。但是，此矩阵存储在 S3 上的文件块中。

我正在尝试通过 Amazon EMR 在 EC2 实例上设置 Spark。我曾尝试进行适当的配置，但在 Amazon 集群上运行 KMeans 时遇到了内存和磁盘错误。以下是我用来创建和配置 Amazon 集群的 python 脚本。

import boto3

def lambda_handler():
  client = boto3.client('emr', region_name='us-west-1')
  client.run_job_flow(
      Name='kmeans',
      ReleaseLabel='emr-4.6.0',
      Instances={
          'MasterInstanceType': 'c3.8xlarge',
          'SlaveInstanceType': 'c3.8xlarge',
          'InstanceCount': 10,
          'Ec2KeyName': 'spark',
          'KeepJobFlowAliveWhenNoSteps': True,
          'TerminationProtected': True
      },
      Steps=[
          {
              'Name': 'kmeans',
              'ActionOnFailure': 'CANCEL_AND_WAIT',
              'HadoopJarStep': {
                  'Jar': 'command-runner.jar',
                  'Args': [
                      'spark-submit',
                      '--driver-memory','55G',
                      '--executor-memory','18G',
                      '--executor-cores','1',
                      '--num-executors','30',
                      '/home/hadoop/process_data.py'
                  ]
              }
          },
      ],
      BootstrapActions=[
          {
              'Name': 'cluster_setup',
              'ScriptBootstrapAction': {
                  'Path': 's3://../setup.sh',
                  'Args': []
              }
          }
      ],
      Applications=[
          {
              'Name': 'Spark'
          },
      ],
      Configurations=[
          {
              "Classification": "spark-env",
              "Properties": {

              },
              "Configurations": [
                  {
                      "Classification": "export",
                      "Properties": {
                          "PYSPARK_PYTHON": "/usr/bin/python2.7",
                          "PYSPARK_DRIVER_PYTHON": "/usr/bin/python2.7"
                      },
                      "Configurations": [

                      ]
                  }
              ]
          },
          {
              "Classification": "spark-defaults",
              "Properties": {
                  "spark.akka.frameSize": "2047",
                  "spark.driver.maxResultSize": "0"
              }
          }
      ],
      VisibleToAllUsers=True,
      JobFlowRole='EMR_EC2_DefaultRole',
      ServiceRole='EMR_DefaultRole'
  )

if __name__=='__main__':
    lambda_handler()

如果有人能就关于 KMeans 聚类所提到的数据大小的以下参数给我一个提示，我将不胜感激？

'MasterInstanceType'
'SlaveInstanceType'
'InstanceCount'
--驱动内存
--执行器内存
--executor-cores
--num-executors
spark.akka.frameSize
spark.driver.maxResultSize

【问题讨论】：

标签： apache-spark configuration k-means apache-spark-mllib amazon-emr

【解决方案1】：

@Quantad 你能解决你的配置问题吗？最近，我发现自己正在研究 Spark.ML 中的 K-means。我有超过 20G 的数据，超过 600 万行和 200 列。这是我使用的配置（使其在 zeppelin 环境中工作，但可以通过 spark-submit 完成类似的操作。

Master: m4.10xlarge
Worker: c3.8xlarge

(spark.executor.cores,4) #this could be increased to make use of all the cores
(spark.driver.memory,60g)
(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -   XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.shuffle.service.enabled,true)
(spark.master,yarn-client)
(spark.hadoop.yarn.timeline-service.enabled,false)
(spark.scheduler.mode,FAIR)
(spark.executor.memory,20g)
(spark.dynamicAllocation.enabled,true) #Spark 2.0.0, this takes care of the number of executors

如果您遇到 GC 限制/java 堆内存问题。您可能还想调整

spark.yarn.executor.memoryOverhead(typically this 10% of the executor memory)

过去我发现其他有用的参考资料， http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ [http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/][2]

【讨论】：