【问题标题】:Unable to access environment variable in PySpark job submitted through Airflow on Google dataproc cluster无法访问通过 Google dataproc 集群上的 Airflow 提交的 PySpark 作业中的环境变量
【发布时间】:2018-12-30 22:48:12
【问题描述】:

我正在通过 Google Dataproc 集群上的 Airflow 运行 PySpark 作业。

此作业从 AWS S3 下载数据,并在处理后将其存储在 Google Cloud Storage 中。因此,为了让执行程序从 Google Dataproc 访问 S3 存储桶,我将 AWS 凭证存储在环境变量中(附加到 /etc/environment),同时通过初始化操作创建 dataproc 集群。

我使用 Boto3 获取凭据,然后设置 Spark 配置。

boto3_session = boto3.Session()
aws_credentials = boto3_session.get_credentials()
aws_credentials = aws_credentials.get_frozen_credentials()
aws_access_key = aws_credentials.access_key
aws_secret_key = aws_credentials.secret_key

spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key)

初始化动作文件:

#!/usr/bin/env bash

#This script installs required packages and configures the environment

wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
sudo pip install boto3
sudo pip install google-cloud-storage


echo "AWS_ACCESS_KEY_ID=XXXXXXXXXXXXX" | sudo tee --append /etc/environment
echo "AWS_SECRET_ACCESS_KEY=xXXxXXXXXX" | sudo tee --append /etc/environment

source /etc/environment

但我收到以下错误:这意味着我的 Spark 进程无法从环境变量中获取配置。

18/07/19 22:02:16 INFO org.spark_project.jetty.util.log: Logging initialized @2351ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
18/07/19 22:02:16 INFO org.spark_project.jetty.server.Server: Started @2454ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@75b67e54{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/07/19 22:02:16 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.7-hadoop2
18/07/19 22:02:17 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-1-m/10.164.0.2:8032
18/07/19 22:02:19 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1532036330220_0004
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:23 INFO DependencyResolver: ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster-1-m:9083
18/07/19 22:02:23 INFO hive.metastore: Connected to metastore.
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/952f73b3-a59c-4a23-a04a-f05dc4e67d89_resources
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89/_tmp_space.db
18/07/19 22:02:24 INFO DependencyResolver: ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/59c3fef5-6c9e-49c9-bf31-69634430e4e6_resources
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6/_tmp_space.db
Traceback (most recent call last):
  File "/tmp/get_search_query_logs_620dea04/download_elk_event_logs.py", line 237, in <module>
    aws_credentials = aws_credentials.get_frozen_credentials()
AttributeError: 'NoneType' object has no attribute 'get_frozen_credentials'
18/07/19 22:02:24 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@75b67e54{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

当我在登录到 dataproc 节点后尝试手动提交作业时,Spark 作业正在验证凭据并且运行良好。

有人可以帮我解决这个问题吗?

【问题讨论】:

  • 如何将凭据添加到/etc/environment 以及何时添加?
  • 看起来仅仅为 Systemd 服务添加变量到/etc/environment 是不够的,您实际上需要获取这个文件(即明确地从中提取变量),这里有几个选项如何做到这一点:@ 987654321@
  • @IgorDvorzhak 我在初始化操作的集群创建阶段将环境变量添加到/etc/environment
  • @IgorDvorzhak:感谢您的帮助 :)

标签: apache-spark pyspark google-cloud-platform airflow google-cloud-dataproc


【解决方案1】:

在 linux 环境中玩了一段时间后,通过 boto3 会话从环境变量中获取 AWS 凭证对我来说没有任何效果。所以,按照boto3文档修改初始化动作脚本如下:

echo "[default]" | sudo tee --append /root/.aws/config
echo "aws_access_key_id = XXXXXXXX" | sudo tee --append /root/.aws/config
echo "aws_secret_access_key = xxxxxxxx" | sudo tee --append /root/.aws/config

通过boto3_session 访问AWS 凭证的方法有多种,其中一种是通过~/.aws/config

【讨论】:

    猜你喜欢
    • 2016-08-15
    • 1970-01-01
    • 2019-07-06
    • 2020-08-23
    • 1970-01-01
    • 2016-01-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多