【发布时间】:2018-08-21 14:03:21
【问题描述】:
我希望能够在我的本地 IDE 中编写 Scala,然后将其作为构建过程的一部分部署到 AWS Glue。但我无法找到构建 AWS 生成的 GlueApp 骨架所需的库。
aws-java-sdk-glue 不包含导入的类,我在其他任何地方都找不到这些库。虽然它们必须存在于某个地方,但也许它们只是这个库的 Java/Scala 端口:aws-glue-libs
来自 AWS 的模板 scala 代码:
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// @params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// @type: DataSource
// @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
// @return: datasource0
// @inputs: []
val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
// @type: ApplyMapping
// @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
// @return: applymapping1
// @inputs: [frame = datasource0]
val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
// @type: DataSink
// @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
// @return: datasink2
// @inputs: [frame = applymapping1]
val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
Job.commit()
}
}
还有build.sbt 我已经开始为本地构建拼凑:
name := "aws-glue-scala"
version := "0.1"
scalaVersion := "2.11.12"
updateOptions := updateOptions.value.withCachedResolution(true)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"
AWS Glue Scala API 的文档似乎概述了与 AWS Glue Python 库中提供的类似功能。因此,也许只需要下载并构建 PySpark AWS Glue 库并将其添加到类路径中?自从 Glue python 库uses Py4J 以来,也许有可能。
【问题讨论】:
-
这里有一个未解决的问题:github.com/awslabs/aws-glue-libs/issues/15
标签: scala pyspark sbt aws-glue