在 Spark 中的 AWS EMR 集群上处理 Google Storage 中的数据答案

【问题标题】：Process Data in Google Storage on an AWS EMR Cluster in Spark在 Spark 中的 AWS EMR 集群上处理 Google Storage 中的数据
【发布时间】：2021-01-16 08:09:08
【问题描述】：

如何在 Spark 中的 AWS EMR 集群上处理存储在 Google Storage 中的数据？

假设我有一些数据存储在gs://my-buckey/my-parquet-data，如何从我的 EMR 集群中读取它，而无需事先将数据复制到 s3 或下载到本地存储？

【问题讨论】：

标签： apache-spark hadoop amazon-s3 google-cloud-storage amazon-emr

【解决方案1】：

首先获取 Google HMAC credentials 访问您要处理的 GS 存储桶/对象

然后使用具有以下 hadoop 配置值的 S3A 文件系统（已与 AWS hadoop 发行版捆绑）：

val conf = spark.sparkContext.hadoopConfiguration
conf.set("fs.s3a.access.key", "<hmac key>")
conf.set("fs.s3a.secret.key", "<hmac secret>")
conf.setBoolean("fs.s3a.path.style.access", true)
conf.set("fs.s3a.endpoint", "storage.googleapis.com")
conf.setInt("fs.s3a.list.version", 1)

然后就可以通过s3a路径访问google storage了，如下：

spark.read.parquet("s3a://<google storage bucket name>/<path>)

【讨论】：

从来不知道你能做到这一点。好的。我只是建议“将 GCS JAR 放入集群并改用 gs:// 链接”
据我所知，用于 GCS 的官方 hadoop 连接器不支持使用 HMAC 凭据，您似乎需要服务帐户凭据。似乎他们希望您在使用 HMAC 凭据时只使用与 S3 兼容的 API。但也许你比我更能理解这份文档：github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/…