【问题标题】:Spark SQL 2.0: NullPointerException with a valid PostgreSQL querySpark SQL 2.0:带有有效 PostgreSQL 查询的 NullPointerException
【发布时间】:2018-08-22 06:04:29
【问题描述】:

我有一个有效的 PostgreSQL 查询:当我在 PSQL 中复制/粘贴它时,我得到了想要的结果。
但是当我使用 Spark SQL 运行时,它会导致 NullPointerException

这是导致错误的代码的sn-p:

extractDataFrame().show()

private def extractDataFrame(): DataFrame = {
  val query =
    """(
      SELECT events.event_facebook_id, events.name, events.tariffrange,
        eventscounts.attending_count, eventscounts.declined_count, eventscounts.interested_count,
        eventscounts.noreply_count,
        artists.facebookid as artist_facebook_id, artists.likes as artistlikes,
        organizers.organizerid, organizers.likes as organizerlikes,
        places.placeid, places.capacity, places.likes as placelikes
      FROM events
        LEFT JOIN eventscounts on eventscounts.event_facebook_id = events.event_facebook_id
        LEFT JOIN eventsartists on eventsartists.event_id = events.event_facebook_id
          LEFT JOIN artists on eventsartists.artistid = artists.facebookid
        LEFT JOIN eventsorganizers on eventsorganizers.event_id = events.event_facebook_id
          LEFT JOIN organizers on eventsorganizers.organizerurl = organizers.facebookurl
        LEFT JOIN eventsplaces on eventsplaces.event_id = events.event_facebook_id
          LEFT JOIN places on eventsplaces.placefacebookurl = places.facebookurl
      ) df"""

  spark.sqlContext.read.jdbc(databaseURL, query, connectionProperties)
}

SparkSession 定义如下:

val databaseURL = "jdbc:postgresql://dbHost:5432/ticketapp" 
val spark = SparkSession
  .builder
  .master("local[*]")
  .appName("tariffPrediction")
  .getOrCreate()

val connectionProperties = new Properties
connectionProperties.put("user", "simon")
connectionProperties.put("password", "root")

这是完整的堆栈跟踪:

[SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 1 times, most recent failure: Lost task 0.0 in stage 27.0 (TID 27, localhost): java.lang.NullPointerException
    at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:]

最令人惊讶的部分是,如果我删除 SQL 查询中的 LEFT JOIN 子句中的一个(无论哪个),我都不会收到任何错误...

【问题讨论】:

    标签: postgresql scala apache-spark apache-spark-sql


    【解决方案1】:

    我有一个非常相似的问题,而不是 Teradata 数据源,它归结为 DataFrame 上的列可空性与基础数据不匹配(该列具有 nullable=false,但某些行在该特定中具有空值场地)。在我的情况下,原因是 Teradata JDBC 驱动程序没有返回正确的列元数据。我还没有找到解决方法。

    查看正在生成的代码(在其中抛出 NPE):

    • 导入 org.apache.spark.sql.execution.debug._
    • 在 DataSet/DataFrame 上调用 .debugCodegen()

    希望这会有所帮助。

    【讨论】:

      【解决方案2】:

      此问题与 Teradata JDBC 驱动程序有关。这个问题在https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/76667/highlight/true#M3798讨论。

      第一页讨论了根本原因。解决方案在第三页。

      来自 Teradata 的人说他们在 16.10.* 驱动程序中使用 MAYBENULL 参数解决了这个问题,但我仍然看到不确定的行为。

      这里有类似的讨论https://issues.apache.org/jira/browse/SPARK-17195

      【讨论】:

        【解决方案3】:

        如果其他人仍在寻找解决方案,您可以在导致问题的列上使用 NULLIF,这是由 JOIN 引起的,导致指定的列中最初为 not null架构。

        相关JIRA:https://issues.apache.org/jira/browse/SPARK-18859

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2016-06-14
          • 2023-03-28
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多