【问题标题】:spark-cassandra-connector on EMR serverless (PySpark)EMR 无服务器 (PySpark) 上的 spark-cassandra-connector
【发布时间】:2022-08-09 22:26:52
【问题描述】:

我正在努力让一个应用程序在 EMR Serverless 上运行,但在拉入 spark-cassandra-connector 时遇到了麻烦。我将它拉到本地没有问题,但是我在 EMR Serverless 上使用该库的所有尝试都失败了。

当我使用--jars s3://XXX/XXXX/spark-cassandra-connector-driver_2.12-3.2.0.jar 包含库时,我在以下行中出错

d = spark \\
    .read \\
    .format(\"org.apache.spark.sql.cassandra\") \\
    .options(table=\"YYYY\", keyspace=\"YYY\") \\
    .load()

有错误

py4j.protocol.Py4JJavaError: An error occurred while calling o121.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at
http://spark.apache.org/third-party-projects.html

当我尝试使用--packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 添加包时,应用程序超时

com.datastax.spark#spark-cassandra-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5ee06249-c545-4b92-804f-ecedd322158a;1.0
    confs: [default]
:: resolution report :: resolve 524554ms :: artifacts dl 0ms
    :: modules in use:
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
    ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
        module not found: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0

    ==== local-m2-cache: tried

      file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom

      -- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:

      file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar

    ==== local-ivy-cache: tried

      /home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/ivys/ivy.xml

      -- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:

      /home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/jars/spark-cassandra-connector_2.12.jar

    ==== central: tried

      https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom

      -- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:

      https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar

    ==== spark-packages: tried

      https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom

      -- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:

      https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

        ::          UNRESOLVED DEPENDENCIES         ::

        ::::::::::::::::::::::::::::::::::::::::::::::

        :: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0: not found

        ::::::::::::::::::::::::::::::::::::::::::::::


:::: ERRORS
    Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))

    Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))

    Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))

    Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))

我打赌--package 问题来自防火墙配置问题,但我没有看到任何打开访问权限的方法。至于--jars 问题,我不确定为什么引入.jar 不足以让Spark 识别org.apache.spark.sql.cassandra 格式。

任何关于这两个问题的帮助将不胜感激,谢谢!

    标签: apache-spark pyspark cassandra amazon-emr spark-cassandra-connector


    【解决方案1】:

    是的,--packages 的问题很可能是由于您的出口设置阻止访问 Maven 中心。

    要使用--jars,您需要指定所有必需的jar,例如driverconnector、Java 驱动程序等。避免这种情况的最简单方法是使用所谓的程序集构建,即available on Maven Central too 和@987654327 @ 协调。只需下载引用的 jar 文件。

    【讨论】:

    • 使用程序集而不仅仅是驱动程序解决了我的问题,非常感谢!
    • 请注意,仅 emr-6.7.0 及更高版本支持 --packages,您需要使用具有出站访问权限的 VPC 配置无服务器应用程序。如果您需要使用您的部门创建自定义 uberjar,请在github.com/aws-samples/emr-serverless-samples/tree/main/… 上查看如何使用 Docker 执行此操作
    猜你喜欢
    • 2016-06-20
    • 2016-07-16
    • 1970-01-01
    • 2019-04-10
    • 1970-01-01
    • 2017-08-06
    • 2020-10-17
    • 2019-10-28
    • 2016-08-20
    相关资源
    最近更新 更多