【问题标题】:Apache Spark 2.3.1 with Hive metastore 3.1.0带有 Hive 元存储 3.1.0 的 Apache Spark 2.3.1
【发布时间】:2019-03-31 08:00:12
【问题描述】:

我们已将 HDP 集群升级到 3.1.1.3.0.1.0-187 并发现:

  1. Hive 有一个新的元存储位置
  2. Spark 看不到 Hive 数据库

事实上我们看到了:

org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database ... not found

您能帮我了解发生了什么以及如何解决这个问题吗?

更新:

配置:

(spark.sql.warehouse.dir,/warehouse/tablespace/external/hive/) (spark.admin.acls,) (spark.yarn.dist.files,file:///opt/folder/config.yml,file:///opt/jdk1.8.0_172/jre/lib/security/cacerts) (spark.history.kerberos.keytab,/etc/security/keytabs/spark.service.keytab) (spark.io.compression.lz4.blockSize,128kb) (spark.executor.extraJavaOptions,-Djavax.net.ssl.trustStore=cacerts) (spark.history.fs.logDirectory,hdfs:///spark2-history/) (spark.io.encryption.keygen.algorithm,HmacSHA1) (spark.sql.autoBroadcastJoinThreshold,26214400) (spark.eventLog.enabled,true) (spark.shuffle.service.enabled,true) (spark.driver.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64) (spark.ssl.keyStore,/etc/security/serverKeys/server-keystore.jks) (spark.yarn.queue,默认) (spark.jars,文件:/opt/folder/component-assembly-0.1.0-SNAPSHOT.jar) (spark.ssl.enabled,true) (spark.sql.orc.filterPushdown,true) (spark.shuffle.unsafe.file.output.buffer,5m) (spark.yarn.historyServer.address,master2.env.project:18481) (spark.ssl.trustStore,/etc/security/clientKeys/all.jks) (spark.app.name,com.company.env.component.MyClass) (spark.sql.hive.metastore.jars,/usr/hdp/current/spark2-client/standalone-metastore/*) (spark.io.encryption.keySizeBits,128) (spark.driver.memory,2g) (spark.executor.instances,10) (spark.history.kerberos.principal,spark/edge.env.project@ENV.PROJECT) (spark.unsafe.sorter.spill.reader.buffer.size,1m) (spark.ssl.keyPassword,*********(已编辑)) (spark.ssl.keyStorePassword,*********(已编辑)) (spark.history.fs.cleaner.enabled,true) (spark.shuffle.io.serverThreads,128) (spark.sql.hive.convertMetastoreOrc,true) (spark.submit.deployMode,client) (spark.sql.orc.char.enabled,true) (spark.master,yarn) (spark.authenticate.enableSaslEncryption,true) (spark.history.fs.cleaner.interval,7d) (spark.authenticate,true) (spark.history.fs.cleaner.maxAge,90d) (spark.history.ui.acls.enable,true) (spark.acls.enable,true) (spark.history.provider,org.apache.spark.deploy.history.FsHistoryProvider) (spark.executor.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64) (spark.executor.memory,2g) (spark.io.encryption.enabled,true) (spark.shuffle.file.buffer,1m) (spark.eventLog.dir,hdfs:///spark2-history/) (spark.ssl.protocol,TLS) (spark.dynamicAllocation.enabled,true) (spark.executor.cores,3) (spark.history.ui.port,18081) (spark.sql.statistics.fallBackToHdfs,true) (spark.repl.local.jars,file:///opt/folder/postgresql-42.2.2.jar,file:///opt/folder/ojdbc6.jar) (spark.ssl.trustStorePassword,*********(已编辑)) (spark.history.ui.admin.acls,) (spark.history.kerberos.enabled,true) (spark.shuffle.io.backLog,8192) (spark.sql.orc.impl,native) (spark.ssl.enabledAlgorithms,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA) (spark.sql.orc.enabled,true) (spark.yarn.dist.jars,file:///opt/folder/postgresql-42.2.2.jar,file:///opt/folder/ojdbc6.jar) (spark.sql.hive.metastore.version,3.0)

来自 hive-site.xml:

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/warehouse/tablespace/managed/hive</value>
</property>

代码如下:

val spark = SparkSession
  .builder()
  .appName(getClass.getSimpleName)
  .enableHiveSupport()
  .getOrCreate()
...
dataFrame.write
  .format("orc")
  .options(Map("spark.sql.hive.convertMetastoreOrc" -> true.toString))
  .mode(SaveMode.Append)
  .saveAsTable("name")

火花提交:

    --master yarn \
    --deploy-mode client \
    --driver-memory 2g \
    --driver-cores 4 \
    --executor-memory 2g \
    --num-executors 10 \
    --executor-cores 3 \
    --conf "spark.dynamicAllocation.enabled=true" \
    --conf "spark.shuffle.service.enabled=true" \
    --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=cacerts" \
    --conf "spark.sql.warehouse.dir=/warehouse/tablespace/external/hive/" \
    --jars postgresql-42.2.2.jar,ojdbc6.jar \
    --files config.yml,/opt/jdk1.8.0_172/jre/lib/security/cacerts \
    --verbose \
    component-assembly-0.1.0-SNAPSHOT.jar \

【问题讨论】:

  • 您可以尝试将 spark-submit 中的 hive.xml 位置作为 --file 命令传递吗?
  • 你能检查spark.sql.warehouse.dir 的值,或许还有hive.metastore.warehouse.dir 的值?您能否在问题中包含 Web UI 中的环境选项卡?您始终可以在 CLASSPATH 上使用 hive-site.xml 来指向目录。
  • 顺便说一句,我似乎无法在docs.hortonworks.com 找到 HDP 的版本。最新的似乎是 HDP-3.0.1。我有点困惑。
  • 谢谢你们的快速响应,伙计们。 Jacek,这个版本:repo.hortonworks.com/content/repositories/releases/org/apache/…
  • 如何访问 Hive 表?你能显示确切的查询(例如spark.read...)吗? Hive仓库的目录是什么?你能检查所有与HADOOP_-、YARN_-或HIVE_相关的环境变量吗?

标签: apache-spark hive apache-spark-sql hive-metastore hdp


【解决方案1】:

看起来这是一个未实现的 Spark feature。但我发现自 3.0 以来使用 Spark 和 Hive 的唯一一种方法是使用 Horton 的HiveWarehouseConnector。文档here。还有来自 Horton 社区here 的良好指导。 在 Spark 开发人员准备好自己的解决方案之前,我不会回答这个问题。

【讨论】:

    【解决方案2】:

    虽然免责声明,但我有一些回归技巧,它绕过了游侠权限(如果您招致管理员的愤怒,请不要怪我)。

    使用 spark-shell

    export HIVE_CONF_DIR=/usr/hdp/current/hive-client/conf
    spark-shell --conf "spark.driver.extraClassPath=/usr/hdp/current/hive-client/conf"
    

    与 sparklyR 一起使用

    Sys.setenv(HIVE_CONF_DIR="/usr/hdp/current/hive-client/conf")
    conf = spark_config()
    conf$'sparklyr.shell.driver-class-path' = '/usr/hdp/current/hive-client/conf'
    

    它也应该适用于 thriftserver,但我还没有测试过。

    【讨论】:

      猜你喜欢
      • 2017-09-17
      • 1970-01-01
      • 2015-01-17
      • 2017-07-07
      • 1970-01-01
      • 2018-07-14
      • 1970-01-01
      • 2018-12-23
      • 2018-10-15
      相关资源
      最近更新 更多