【问题标题】:How to read Druid data using JDBC driver with spark?如何使用带 spark 的 JDBC 驱动程序读取 Druid 数据?
【发布时间】:2021-02-04 03:37:01
【问题描述】:

如何使用 spark 和 Avatica JDBC Driver 从 Druid 读取数据? This is avatica JDBC document

使用 python 和 Jaydebeapi 模块从 Druid 读取数据,我成功如下代码。

$ python
import jaydebeapi

conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
                          "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
                          {"user": "druid", "password":"druid"},
                          "/root/avatica-1.17.0.jar",
       )
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()

输出是:

[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')]  -> default tables

但我想使用 spark 和 JDBC 阅读。

我试过了,但使用 spark 时出现问题,如下面的代码。

$ pyspark --jars /root/avatica-1.17.0.jar

df = spark.read.format('jdbc') \
    .option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
    .option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
    .option('user', 'druid') \
    .option('password', 'druid') \
    .option('driver', 'org.apache.calcite.avatica.remote.Driver') \
    .load()

输出是:

Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
  File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
 at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
 at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46] 
...

注意:

【问题讨论】:

    标签: apache-spark jdbc apache-spark-sql druid apache-calcite


    【解决方案1】:

    我找到了解决这个问题的另一种方法。我用spark-druid-connector连接druid和spark。

    但我更改了一些代码,例如 this,以便将此代码用于我的环境。

    这是我的环境:

    • 火花:2.4.4
    • scala:2.11.12
    • python:python 3.6.8
    • 德鲁伊:
      • 动物园管理员:3.5
      • 德鲁伊:0.17.0

    但是,它有一个问题。

    • 如果你至少使用过一次 spark-druid-connector,那么下面使用的所有像spark.sql("select * from tmep_view") 这样的 sql 查询都会被输入到这个规划器中。
    • 但是,如果你使用像df.distinct().count()这样的dataframe的api,那么就没有问题了。我还没解决。

    【讨论】:

      【解决方案2】:

      我尝试使用 spark-shell:

      ./bin/spark-shell --driver-class-path avatica-1.17.0.jar --jars avatica-1.17.0.jar

      val jdbcDF = spark.read.format("jdbc")
          .option("url", "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/")
          .option("dbtable", "INFORMATION_SCHEMA.TABLES")
          .option("user", "druid")
          .option("password", "druid")
          .load()
      

      【讨论】:

        猜你喜欢
        • 2019-01-24
        • 1970-01-01
        • 1970-01-01
        • 2021-01-31
        • 2015-06-15
        • 2017-12-03
        • 2013-08-12
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多