【发布时间】:2018-05-25 17:55:01
【问题描述】:
我是大数据技术的新手。我必须在 EMR 上以集群模式运行 spark 作业。这项工作是用 python 编写的,它依赖于几个库和一些其他工具。我已经编写了脚本并在本地客户端模式下运行它。但是当我尝试使用 yarn 运行它时会出现一些依赖问题。如何管理这些依赖关系?
日志:
"/mnt/yarn/usercache/hadoop/appcache/application_1511680510570_0144/container_1511680510570_0144_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
__import__(name)
ImportError: ('No module named boto3', <function subimport at 0x7f8c3c4f9c80>, ('boto3',))
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
【问题讨论】:
-
您能在此处提供错误消息吗?
-
我已经添加了错误信息
标签: pyspark hadoop-yarn emr