Pyspark 可为空的 uuid 类型 uuid 但表达式的类型为字符变化答案

【问题标题】：Pyspark nullable uuid type uuid but expression is of type character varyingPyspark 可为空的 uuid 类型 uuid 但表达式的类型为字符变化
【发布时间】：2020-11-03 22:39:30
【问题描述】：

给定一个带有 non-nullable uuid 列和 nullable uuid 列的表设计，如何使用 python 3.7.9 和 Pyspark 2.4.3 数据框插入postgresql-42.2.18.jar 驱动？

table_df = spark.read.format('jdbc) \
                     .option('driver', 'org.postgresql.Driver') \
                     .option('dbtable', 'example_table') \
                     .load()

table_df.printSchema()

root
 |-- id: string (nullable = false)
 |-- created: timestamp (nullable = true)
 |-- modified: timestamp (nullable = true)
 |-- example_uuid: string (nullable = true)


from pyspark.sql.functions import when, lit, col

from pyspark.sql.types import NullType, StringType

def replace(column, value):
  return when (column == value, lit(None).cast(NullType())).otherwise(column.cast(StringType()))

example_df = tasklog_df.withColumn("example_uuid", replace(col("example_uuid"), "NULL"))

example_df.write.mode('append').format('jbdc') \
                .option('driver', 'org.postgresql.Driver')\
                .option('stringtype', 'unspecified') \
                .save()

这会导致 Pyspark 尝试插入

INSERT INTO example_table
 ("id",
 "created",
 "modified",
 "example_uuid") 
VALUES 
 ('b49a90aa-a415-4aeb-a7ed-bfc42e43f5c7',
 '2020-03-29 02:00:11.06534-07',
 '2020-03-29 02:00:11.065361-07',
 NULL)

这会导致臭名昭著

ERROR: column "example_uuid" is of type uuid but expression is of type character
  Hint: You will need to rewrite or cast the expression.

我已经投射了数据。 Pyspark 没有生成正确的 INSERT 语句，或者 postgres 驱动程序将单词 NULL 视为字符而不是关键字。我需要使用.option('stringtype', 'unspecified')，以免Pyspark 抱怨id 列是uuid。

lit(None).cast(NullType()) 似乎什么也没做。 pyspark.sql.types 中没有 uuid 类型的条目。

如果没有option('stringtype', 'unspecified')，那么 Pyspark 会抛出错误：

Caused by: org.postgresql.util.PSQLException: ERROR: column "id" is of type uuid but expression is of type character varying
  Hint: You will need to rewrite or cast the expression.

剩下的唯一方法似乎是将数据帧拆分为两个数据帧，一个具有包含 NULL 的 example_uuid 字段，另一个是 example_uuid 字段是 uuid。然后使用 NULL 从数据框中删除 example_uuid 字段，以便在保存到表时不会引发错误。当 Pyspark 应该只支持 uuid 类型时，这似乎浪费了很多精力。建议或建议？

【问题讨论】：

标签： python postgresql apache-spark pyspark

【解决方案1】：

在 pyspark 中包含此功能需要将相应的类型添加到 Spark SQL 中，并且添加可能需要付出很大的努力，因为它需要更改优化器和其他部分。作为解决方法，这应该在 JDBC 驱动程序中完成（但这可能违反 JDBC 规范），或者在 Spark 本身的 JDBC 连接器中完成，但它不会与数据库无关。例如，Spark Cassandra 连接器在将数据保存到 Cassandra 时会自动转换为“兼容”，这允许它在读/写时支持 UUID，尽管在 Spark 中它们表示为字符串。

【讨论】：

【解决方案2】：

我知道这很讨厌并且会影响存储，但在解决此问题之前（即支持 UUID 数据类型和/或能够为 UUID 字段写入 NULL），我选择存储零 GUID (00000000-0000-0000- 0000-000000000000)。

.withColumn("nullable_uuid_field", coalesce($"nullable_uuid_field", lit("00000000-0000-0000-0000-000000000000")))

【讨论】：

【解决方案3】：

我个人最终拆分了我的写入并依赖 db 来设置 null。

insert(to_insert.where(F.col("col_name").isNull()).drop("col_name"))
insert(to_insert.where(F.col("col_name").isNotNull()))

【讨论】：