【问题标题】:Left Anti join not consider null as duplicate values in Spark左反连接不将 null 视为 Spark 中的重复值
【发布时间】:2020-06-17 06:51:12
【问题描述】:

我有两个表,我只想从源表中读取唯一记录,这两个表都有空值。

source table:

name| age| degree| dept    
aaa | 20| ece |null
bbb |20 |it |null
ccc |30 |mech| null

target table


name| age |degree |dept
aaa  |20| ece |null
bbb |20 |it| null

soruce_df.join(target_df,seq("name","age","degree"),"leftanti") - >工作

soruce_df.join(target_df,seq("name","age","degree","dept"),"leftanti") ->不工作

Now i need to pick only 3rd record from source ,

 If i use name ,age ,degree   as my joining key , it's working as expected

But when i include dept it's picking all the records from source table.

Please help me.

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql


    【解决方案1】:

    进行对空值安全的平等测试。

        soruce_df.join(target_df, soruce_df("name") <=> target_df("name") && soruce_df("age") <=> target_df("age") &&
          soruce_df("degree") <=> target_df("degree") && soruce_df("dept") <=> target_df("dept")
          ,"leftanti").show(false)
    
        /**
          * +----+---+------+----+
          * |name|age|degree|dept|
          * +----+---+------+----+
          * |ccc |30 |mech  |null|
          * +----+---+------+----+
          */
    

    在python中,将&lt;=&gt;替换为方法调用eqNullSafe,如下示例-

    df1.join(df2, df1["value"].eqNullSafe(df2["value"]))
    

    【讨论】:

      【解决方案2】:

      spark 提供了 null 安全的相等运算符来处理这种情况。曾面临过类似的情况,即由于一列为空而插入重复记录。 null == null 返回 null null null 返回 false 见文档 https://spark.apache.org/docs/3.0.0-preview/sql-ref-null-semantics.html

      【讨论】:

        猜你喜欢
        • 2019-01-02
        • 1970-01-01
        • 2021-12-07
        • 1970-01-01
        • 2021-06-07
        • 2014-08-29
        • 2014-12-26
        • 2015-08-02
        相关资源
        最近更新 更多