【问题标题】:Zeppelin Pyspark on HDP 2.3 giving errorHDP 2.3 上的 Zeppelin Pyspark 给出错误
【发布时间】:2015-10-27 03:34:34
【问题描述】:

我正在尝试配置 zeppelin 以使用 HDP 2.3 (Spark 1.3)。我已经通过 Ambari 成功安装了 zeppelin,并且 zeppelin 服务正在运行。

但是当我尝试运行任何 %pyspark 命令时,我收到以下错误。

我读了几篇博客,但似乎在 Java 6 和 Java 7 上编译的 jar 存在一些问题,这些问题在 Python 和 Spark 之间共享。

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, sandbox.hortonworks.com): org.apache.spark.SparkException: 
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
    at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.\n', JavaObject id=o68), <traceback object at 0x2618bd8>)
Took 0 seconds

【问题讨论】:

    标签: apache-spark pyspark hortonworks-data-platform apache-zeppelin


    【解决方案1】:

    如果您有以下行,您可以检查您的 zeppelin-env.sh 吗?

    export PYTHONPATH=${SPARK_HOME}/python
    

    如果缺少,可以通过 Ambari 在 Zeppelin > Configs > Advanced zeppelin-env > zeppelin-env 模板下添加

    不过,如果您使用最新版本的 Ambari service for zeppelin 安装,那么它应该会为您完成以下操作: https://github.com/hortonworks-gallery/ambari-zeppelin-service/blob/master/configuration/zeppelin-env.xml#L63

    【讨论】:

    • 是的,我设置了这个变量。就像您说的那样,安装时会自动进行处理。谢谢
    【解决方案2】:

    我刚刚使用 Ambari 2.1 在 Centos 6.5 上设置了新的 HDP 2.3 设置(2.3.0.0-2557),并使用 Ambari zeppelin 服务(使用默认配置)安装了 zeppelin。 Pyspark 似乎对我来说很好用。

    根据您的错误,听起来 PYTHONPATH 没有设置为正确的值:

    PYTHONPATH was:
      /opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
    

    在 zeppelin 中,您可以在单元格中输入以下内容并运行它并提供输出吗?

    System.getenv().get("MASTER")
    System.getenv().get("SPARK_YARN_JAR")
    System.getenv().get("HADOOP_CONF_DIR")
    System.getenv().get("JAVA_HOME")
    System.getenv().get("SPARK_HOME")
    System.getenv().get("PYSPARK_PYTHON")
    System.getenv().get("PYTHONPATH")
    System.getenv().get("ZEPPELIN_JAVA_OPTS")
    

    这是我设置的输出:

    res41: String = yarn-client
    res42: String = hdfs:///apps/zeppelin/zeppelin-spark-0.6.0-SNAPSHOT.jar
    res43: String = /etc/hadoop/conf
    res44: String = /usr/java/default
    res45: String = /usr/hdp/current/spark-client/
    res46: String = null
    res47: String = /usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/lib/pyspark.zip:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip
    res48: String = -Dhdp.version=2.3.0.0-2557 -Dspark.executor.memory=512m -Dspark.yarn.queue=default
    

    【讨论】:

      猜你喜欢
      • 2016-04-08
      • 1970-01-01
      • 2016-03-06
      • 2018-08-31
      • 2020-01-24
      • 2016-03-25
      • 2019-04-12
      • 2015-12-08
      • 2018-03-22
      相关资源
      最近更新 更多