【问题标题】:Notebook based on jupyter/all-spark-notebook docker image not picking up custom python version基于 jupyter/all-spark-notebook docker 镜像的笔记本没有选择自定义 python 版本
【发布时间】:2021-07-13 02:34:38
【问题描述】:

总结

我正在尝试在一个全火花笔记本中执行this simple python code snippet,该笔记本应该在我在this docker-compose file 中设置的本地火花集群中执行。但是,我收到了错误 ModuleNotFoundError: No module named 'pyspark',这对我来说没有意义,因为在 this Dockerfile(我从 docker repos documentation 中获取)中,我使用 pip 明确安装了 pyspark。

重现错误的步骤

# Clone the repository and checkout a specific commit
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ git clone https://github.com/kevinsuedmersen/hadoop-sandbox.git
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ git checkout e0a061dd3a60842aa0e93893892c7e0844c2278a

# Install and start all services
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker-compose up -d

# Entering the container running the notebooks
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker exec -it jupyter-spark bash

# Activating the custom python environment installed in the above referenced Dockerfile
(base) jovyan@XXX:~$ conda activate python37

# Start a jupyter notebook server
(python37) jovyan@XXX:~$ jupyter notebook

# After some logging, the following output shows
To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-27913-open.html
Or copy and paste one of these URLs:
http://b8ef36545270:8889/?token=some_token
or http://127.0.0.1:8889/?token=some_token

然后,我点击 URL http://127.0.0.1:8889/?token=some_token 在浏览器中打开 jupyter GUI,执行 the simple python code snippet 并得到上述错误。

我尝试过的

为了检查 pyspark 是否真的安装了,我基本上只是尝试在 jupyter-spark 容器的 shell 中执行the simple python code snippet,令人惊讶的是,它起作用了。具体来说,我在一个新的 shell 中执行了以下命令:

# Entering into the jupyter-spark container and activating the custom python environment
kevinsuedmersen@LAPTOP-XXX:~/dev/hadoop-sandbox$ docker exec -it jupyter-spark bash
(base) jovyan@XXX:~$ conda activate python37

# Opening a python shell
(python37) jovyan@XXX:~$ python

# Copy pasting the same commands from the notebook into the shell
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.master('spark://spark-master:7077').getOrCreate()
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(range(100 + 1))
>>> rdd.sum()
5050

此外,我注意到在笔记本中执行以下操作

! python --version

打印Python 3.8.8

那么,我的问题是:我怎样才能让 notebook 使用自定义的 python 环境

【问题讨论】:

    标签: docker apache-spark hadoop pyspark jupyter-notebook


    【解决方案1】:

    因此,显然,以下解决方法有效:

    1. 将 jupyter-spark 服务的 Dockerfile 更改为如下简单的内容:
    FROM jupyter/all-spark-notebook:584f43f06586
    
    ARG SPARK_VERSION
    ARG HADOOP_VERSION
    ARG SPARK_CHECKSUM
    ARG OPENJDK_VERSION
    ARG PYTHON_VERSION
    
    # Install a different version of python inside the base environment
    RUN conda install -y python=$PYTHON_VERSION
    
    # Install required pip packages, e.g. pyspark
    COPY requirements.txt /docker_build/requirements.txt
    RUN pip install -r /docker_build/requirements.txt
    
    1. docker-compose.yml 文件中的服务定义变为:
    # Spark notebooks
      jupyter-spark:
        # To see all running servers in this container, execute 
        # `docker exec jupyter-spark jupyter notebook list`
        container_name: jupyter-spark
        build:
          context: jupyter-spark
          args: 
            - SPARK_VERSION=3.1.1
            - HADOOP_VERSION=3.2
            - SPARK_CHECKSUM=E90B31E58F6D95A42900BA4D288261D71F6C19FA39C1CB71862B792D1B5564941A320227F6AB0E09D946F16B8C1969ED2DEA2A369EC8F9D2D7099189234DE1BE
            - OPENJDK_VERSION=11
            # Make sure the python version in the driver (the notebooks) is the same as in spark-master,
            # spark-worker-1, and spark-worker-2
            - PYTHON_VERSION=3.7.10
        ports: 
          - 8888:8888
          - 8889:8889
          - 4040:4040
          - 4041:4041
        volumes:
          - ./jupyter-spark/work:/home/jovyan/work
        pid: host
        environment: 
          - TINI_SUBREAPER=true
        env_file: 
          - ./hadoop.env
        networks:
          - hadoop
    

    以上改动的仓库当前工作状态可以是here

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-01-27
      • 2018-02-15
      • 2017-04-30
      • 2018-08-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多