在 pyspark 中查询 HIVE 表答案

【问题标题】：Query HIVE table in pyspark在 pyspark 中查询 HIVE 表
【发布时间】：2016-07-03 05:49:32
【问题描述】：

我正在使用 CDH5.5

我在 HIVE 默认数据库中创建了一个表，并且能够从 HIVE 命令中查询它。

输出

hive> use default;

OK

Time taken: 0.582 seconds


hive> show tables;

OK

bank
Time taken: 0.341 seconds, Fetched: 1 row(s)

hive> select count(*) from bank;

OK

542

Time taken: 64.961 seconds, Fetched: 1 row(s)

但是，我无法从 pyspark 查询该表，因为它无法识别该表。

from pyspark.context import SparkContext

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)


sqlContext.sql("use default")

DataFrame[result: string]

sqlContext.sql("show tables").show()

+---------+-----------+

|tableName|isTemporary|

+---------+-----------+

+---------+-----------+


sqlContext.sql("FROM bank SELECT count(*)")

16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*)
16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql
      return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
    File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",   line 538, in __call__
    File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco
      raise AnalysisException(s.split(': ', 1)[1])
  **pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**

新错误

>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> bank = hive_context.table("default.bank")
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table
    return DataFrame(self._ssql_ctx.table(tableName), self)
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.table.
: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
    at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
    at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
    at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
    at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
    at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
    at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
    at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

谢谢

【问题讨论】：

标签： hive pyspark

【解决方案1】：

我们不能将 Hive 表名直接传递给 Hive 上下文 sql 方法，因为它不理解 Hive 表名。在 pyspark shell 中读取 Hive 表的一种方法是：

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("default.bank")
bank.show()

在 hive 表上运行 SQL：首先，我们需要注册我们通过读取 hive 表获得的数据帧。然后我们就可以运行 SQL 查询了。

bank.registerTempTable("bank_temp")
hive_context.sql("select * from bank_temp").show()

【讨论】：

bank = hive_context.table("bank") Traceback（最近一次调用最后）：文件“”，第 1 行，在文件“/usr/lib/spark/python /pyspark/sql/context.py”，第 565 行，在表中返回 DataFrame(self._ssql_ctx.table(tableName), self)File “/usr/lib/spark/python/lib/py4j-0.8.2.1-src. zip/py4j/java_gateway.py”，第 538 行，在 call 文件中“/usr/lib/spark/python/pyspark/sql/utils.py”，第 36 行，在 deco 中返回 f(* a, **kw) 文件“/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py”，第 300 行，在 get_return_value py4j.protocol.Py4JJavaError: 发生错误在调用 o30.table 时。
我已编辑答案以包含数据库名称。它现在应该可以工作了。
嗨 Bijay697，我收到错误 org.apache.spark.sql.catalyst.analysis.NoSuchTableException。我更新了原始帖子中的错误（在新错误下）。访问 HIVE Metastore 是否需要任何特殊配置？
错误消息表示 Hive 中不存在该表。您可以尝试在另一个数据库中创建表而不是在 Hive 中创建默认表。此外，如果您在集群模式下提交作业，您可能需要传递 hive-site.xml。
@Sledge 即 SparkContext，会话中的默认变量

【解决方案2】：

您可以使用sqlCtx.sql。 hive-site.xml 应该被复制到 spark conf 路径。

my_dataframe = sqlCtx.sql("Select * from categories")
my_dataframe.show()

【讨论】：

【解决方案3】：

SparkSQL 附带自己的元存储 (derby)，因此即使系统上未安装 hive，它也可以工作。这是默认模式。

在上述问题中，您在 hive 中创建了一个表。您会收到 table not found 错误，因为 SparkSQL 正在使用其默认元存储，该元存储没有您的配置单元表的元数据。

如果您希望 SparkSQL 使用 hive 元存储并访问 hive 表，则必须在 spark conf 文件夹中添加 hive-site.xml。

【讨论】：

“spark conf 文件夹”是什么意思？如果 pyspark 在 Zeppelin 应用程序中运行，有什么要特别提及的吗？

【解决方案4】：

我的问题的解决方案是，cp hive-site.xml 到你的$SPARK_HOME/conf，cp mysql-connect-java-*.jar 到你的$SPARK_HOME/jars，这个解决方案解决了我的问题.

【讨论】：