【问题标题】:Why does PySpark not find spark-submit when creating a SparkSession?为什么创建 SparkSession 时 PySpark 找不到 spark-submit?
【发布时间】:2021-09-21 04:02:39
【问题描述】:

我正在尝试在运行 Linux Mint 的本地计算机上使用 Jupyter Notebook 初始化 PySpark 集群。我关注this tutorial。当我尝试创建 SparkSession 时,我收到 spark-submit 不存在的错误。奇怪的是,当我尝试在不包含 sudo 的情况下获取 spark-shell 的版本时,我遇到了同样的错误。

spark1 = SparkSession.builder.appName('Test').getOrCreate()

FileNotFoundError: [Errno 2] No such file or directory: '~/Spark/spark-3.1.2-bin-hadoop3.2/./bin/spark-submit'

spark-submit 的正确目录是

'~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit'(没有额外的./,但之前的目录应该仍然有效,对吧?)

我不知道 Spark 是从哪里得到这个目录的,所以我不知道在哪里更正它。

如前所述,如果不包含sudo,我什至无法获得 spark-shell 的版本:

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ./spark-shell --version
./spark-shell: line 60: ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit: No such file or directory

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ls | grep spark-submit
spark-submit
spark-submit2.cmd
spark-submit.cmd

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ sudo ./spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/
                        
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

我尝试允许对~/Spark 中的所有文件的读取、写入和执行权限,但没有任何效果。 这可能与 Java 权限有关吗?

我的.bashrc 看起来像这样:

export SPARK_HOME='~/Spark/spark-3.1.2-bin-hadoop3.2'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

我使用的是 Python 3.8 和预构建的 Hadoop 3.2 的 Apache Spark 3.1.2。我的 Java 版本是 openjdk 11

编辑:重新安装后(不修改权限)~/Spark/spark-3.1.2-bin-hadoop3.2/bin/中的文件为:

$ ls -al ~/Spark/spark-3.1.2-bin-hadoop3.2/bin
total 124
drwxr-xr-x  2 squid squid  4096 May 23 21:45 .
drwxr-xr-x 13 squid squid  4096 May 23 21:45 ..
-rwxr-xr-x  1 squid squid  1089 May 23 21:45 beeline
-rw-r--r--  1 squid squid  1064 May 23 21:45 beeline.cmd
-rwxr-xr-x  1 squid squid 10965 May 23 21:45 docker-image-tool.sh
-rwxr-xr-x  1 squid squid  1935 May 23 21:45 find-spark-home
-rw-r--r--  1 squid squid  2685 May 23 21:45 find-spark-home.cmd
-rw-r--r--  1 squid squid  2337 May 23 21:45 load-spark-env.cmd
-rw-r--r--  1 squid squid  2435 May 23 21:45 load-spark-env.sh
-rwxr-xr-x  1 squid squid  2634 May 23 21:45 pyspark
-rw-r--r--  1 squid squid  1540 May 23 21:45 pyspark2.cmd
-rw-r--r--  1 squid squid  1170 May 23 21:45 pyspark.cmd
-rwxr-xr-x  1 squid squid  1030 May 23 21:45 run-example
-rw-r--r--  1 squid squid  1223 May 23 21:45 run-example.cmd
-rwxr-xr-x  1 squid squid  3539 May 23 21:45 spark-class
-rwxr-xr-x  1 squid squid  2812 May 23 21:45 spark-class2.cmd
-rw-r--r--  1 squid squid  1180 May 23 21:45 spark-class.cmd
-rwxr-xr-x  1 squid squid  1039 May 23 21:45 sparkR
-rw-r--r--  1 squid squid  1097 May 23 21:45 sparkR2.cmd
-rw-r--r--  1 squid squid  1168 May 23 21:45 sparkR.cmd
-rwxr-xr-x  1 squid squid  3122 May 23 21:45 spark-shell
-rw-r--r--  1 squid squid  1818 May 23 21:45 spark-shell2.cmd
-rw-r--r--  1 squid squid  1178 May 23 21:45 spark-shell.cmd
-rwxr-xr-x  1 squid squid  1065 May 23 21:45 spark-sql
-rw-r--r--  1 squid squid  1118 May 23 21:45 spark-sql2.cmd
-rw-r--r--  1 squid squid  1173 May 23 21:45 spark-sql.cmd
-rwxr-xr-x  1 squid squid  1040 May 23 21:45 spark-submit
-rw-r--r--  1 squid squid  1155 May 23 21:45 spark-submit2.cmd
-rw-r--r--  1 squid squid  1180 May 23 21:45 spark-submit.cmd

【问题讨论】:

  • 你能显示 ~/Spark/spark-3.1.2-bin-hadoop3.2/bin 的输出吗?
  • 我的意思是:ls -al ~/Spark/spark-3.1.2-bin-hadoop3.2/bin

标签: python apache-spark pyspark jupyter-notebook


【解决方案1】:

为什么“squid”拥有所有这些文件的所有权?您能否将用户/组所有权设置为用于运行这些提交的用户,因此需要在 .bashrc 中定义所有这些环境变量

【讨论】:

  • 我不明白。我为什么不呢?我只是从spark.apache.org/downloads.html 下载spark-3.1.2-bin-hadoop3.2.tgz 并将其解压缩到此文件夹中,所以我拥有所有权。
  • 请如前所述更改所有权并尝试运行 spark submit。
猜你喜欢
  • 2014-12-30
  • 2020-06-13
  • 1970-01-01
  • 2022-11-20
  • 2022-08-13
  • 1970-01-01
  • 2020-11-13
  • 1970-01-01
  • 2018-02-10
相关资源
最近更新 更多