将镶木地板文件存储到 PostgreSQL 数据库中答案

【问题标题】：Storing parquet file into PostgreSQL Database将镶木地板文件存储到 PostgreSQL 数据库中
【发布时间】：2018-09-30 23:26:38
【问题描述】：

我想将 parquet 文件写入 PostgreSQL。我正在使用 Spark 并使用 Spark Dataframe 的 write.jdbc 函数编写文件。一切都适用于镶木地板列类型，如长、小数或文本。问题在于像 Map 这样的复杂类型。我想将 Map 作为 json 存储在我的 PostgreSQL 中。因为我知道 PostgreSQL 可以自动将文本数据类型转换为 json（使用强制转换操作），所以我将 map 转储为 json 字符串。

但 spark 程序抱怨我们试图将“字符变化”数据类型插入“json”类型的列中。这清楚地表明 PostgreSQL 不会自动将“字符变化”转换为 JSON。

我继续并登录到我的数据库并手动尝试将 JSON 字符串插入到表的 JSON 数据类型列中，并且成功了。

我的问题是为什么我的 spark 程序抱怨演员操作？

我使用的是 Spark 版本 1.6.1、PostgreSQL 4.3 和 JDBC 42.1.1

这里是代码sn-p

url = "jdbc:postgresql://host_name:host_port/db_name"
data_frame.write.jdbc(url, table_name, properties={"user": user, "password": password})

错误堆栈跟踪：

Hint: You will need to rewrite or cast the expression.
  Position: 66  Call getNextException to see other errors in the batch.
    at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:148)
    at org.postgresql.core.ResultHandlerDelegate.handleError(ResultHandlerDelegate.java:50)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2190)
    at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1325)
    at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1350)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:458)
    at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:791)
    at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1547)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:215)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:277)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:276)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: org.postgresql.util.PSQLException: ERROR: column "value" is of type json but expression is of type character varying
  Hint: You will need to rewrite or cast the expression.
  Position: 66
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2476)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2189)
    ... 18 more

【问题讨论】：

data_frame.show() 的输出也将极大地帮助您给出答案。

标签： postgresql apache-spark jdbc pyspark parquet

【解决方案1】：

您是否使用 aws 服务。如果是，则使用创建表的 aws 胶水抓取您的文件。创建一个胶水作业，将此数据（表）作为输入目录并为输出选择 aws rds jdbc 连接并选择所需的数据库。运行作业，您的 paraquet 文件数据将加载到 postgres 表中。

【讨论】：

【解决方案2】：

已经很晚了，但这里是任何迷失灵魂的答案。

您需要将“stringtype”参数传递给 JDBC。它指定绑定通过 setString() 设置的 PreparedStatement 参数时要使用的类型。默认情况下，它是 varchar，它强制参数是 varchar 并阻止任何强制转换操作（在我的情况下是 JSON 字符串到 JSON）。如果我们指定， stringtype=="unspecified" 然后它留给数据库来决定参数的类型。就我而言，它有助于 Postgres 轻松地将字符串转换为 JSON。

文档：https://jdbc.postgresql.org/documentation/head/connect.html

【讨论】：