【问题标题】:SQL(Impala)- Finding duplicates values including null valuesSQL(Impala) - 查找重复值,包括空值
【发布时间】:2021-06-02 10:23:30
【问题描述】:

我有一个名为 baseTable 的表,其中包含许多列,但是我使用 3 个名为 Material_Type、Material_Desc(具有空值)、Material_Number 的列来使用 row_num 和分区方式查找重复项。 注意:我需要根据 3 个条件过滤重复项。

  1. 当 Material_Type = Material_Type 时
  2. 当 Material_Desc = Material_Desc
  3. 当 Material_Number Material_Number

样本表:

Material_Type  Material_Desc  Material_Number 
 ABC                XYZ              1
 ABC                XYZ              1
 ABC                XYZ              2
 ABC                XYZ              3
 DEF                IMM              1
 LMN                NULL             1
 LMN                NULL             2

我只想在 newTable 中有重复的值,并想删除不同的值。

期望的输出:

Material_Type  Material_Desc  Material_Number  new
 ABC                XYZ              1          1
 ABC                XYZ              1          2
 ABC                XYZ              2          3
 ABC                XYZ              3          4
 LMN                NULL             1          1
 LMN                NULL             2          2

我使用了下面的查询,但没有得到预期的输出,因为它不包括来自 Material_Desc 列的空值,并且没有使用 Null 进行分区,并且还会创建不需要的重复记录。

使用的查询:

create table newTable as 
with mycte as
(
select
m.MATERIAL_NUMBER
,m.MATERIAL_TYPE
,m.Material_Desc,
row_number() over(partition BY d.MATERIAL_TYPE,d.Material_Desc order by d.MATERIAL_NUMBER) as new
from baseTable m
inner join
(
select MATERIAL_NUMBER,MATERIAL_TYPE,Material_Desc,count(*) from baseTable group by
MATERIAL_NUMBER,MATERIAL_TYPE,Material_Desc having count(*) > 1
) d on d.MATERIAL_NUMBER <> m.MATERIAL_NUMBER and d.MATERIAL_TYPE=m.MATERIAL_TYPE 
and d.Material_Desc= m.Material_Desc)
select * from mycte 

任何帮助将不胜感激。

【问题讨论】:

    标签: sql duplicates cloudera impala hue


    【解决方案1】:

    只需将row_number()count(*) 用作窗口函数:

    select Material_Type, Material_Desc, Material_Number,
           row_number() over (partition by Material_Type, Material_Desc  order by Material_Number) as new
    from (select t.*,
                 count(*) over (partition by Material_Type, Material_Desc) as cnt
          from t
         ) t
    where cnt > 1;
    

    这适用于您提供的数据,只需计算每种类型和描述的行数。如果您确实需要不同的材料编号,一种方法是min()max()

    select Material_Type, Material_Desc, Material_Number,
           row_number() over (partition by Material_Type, Material_Desc order by Material_Number) as new
    from (select t.*,
                 min(Material_Number) over (partition by Material_Type, Material_Desc) as min_Material_Number,
                 max(Material_Number) over (partition by Material_Type, Material_Desc) as max_Material_Number
          from t
         ) t
    where min_Material_Number <> max_Material_Number;
    

    【讨论】:

      猜你喜欢
      • 2012-12-29
      • 1970-01-01
      • 2015-10-04
      • 2018-06-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-02-05
      相关资源
      最近更新 更多