【问题标题】:PARTITION BY Name, Id to compare and detect problemsPARTITION BY Name, Id 比较和检测问题
【发布时间】:2015-10-18 13:48:19
【问题描述】:

解释

假设这里有 3 家公司。我们通过Name 加入表格,因为并非每个员工都提供了他的PersonalNoStringId只有专家,也不能用于加盟。同一员工可以在多家公司工作。


问题

问题是可能有不同的员工同名(名字和姓氏相同,例如只提供名字)。


我需要什么?

数据有问题时返回1,如果正确则返回0


检测问题的规则

  1. 当有多个相同的名称(2 或更多)并且都具有相同的 PersonalNo 并且并非所有都具有 StringId作为彼得)应该返回 1错了
  2. 当有多个相同的名称(2 个或更多)并且有 NULL参见 John),但它们都具有相同的 StringId 时应该返回0没错,就是没有提供的公司之一PersonalNo
  3. 当有多个相同的名称(2 个或更多)并且所有PersonalNo 都相等并且所有StringId 都相等时(参见 Lisa)它应该返回 @ 987654335@(正确
  4. 当有多个相同的名字(2 个或更多)并且有多个不同的PersonalNo 和所有StringId 提供时应该是这样的:我们看到这里有 2 个不同的人 Jennifer 和@ 987654338@ PersonalNo 和 Jennifer 和 4920225088 PersonalNo,Jennifer 和 NULL PersonalNo 和 Jennifer 一样 StringId4920225088 PersonalNo 所以它应该返回 @9876543 ) 和 4805250141 PersonalNo 的 Jennifer 不应该被选中,因为 StringID 并且只有 1 行具有相同的 PersonalNo
  5. 如果只有 1 行并且没有提供 StringId 它根本不应该出现在选择中。

样本数据

Company     Name        PersonalNo   StringId 
Comp1       Peter       3850342515    85426 -------------------------------------------------------------------
Comp2       Peter       3850342515    ''    -- If have the same PersonalNo and there is no StringId - 1 (wrong)
Comp1       John        NULL          12345 ------------------------------------------------------------------
Comp2       John        3952525252    12345 -- If have the same StringId and 1 PersonalNo is NULL - 0 (correct)
Comp1       Lisa        4951212581    52124 ----------------------------------------------------------------
Comp3       Lisa        4951212581    52124 -- If PersonalNo are equal and StringId are equal - 0 (correct)
Comp1       Jennifer    4805250141    ''    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comp1       Jennifer    4920225088    55443 -- If have 2 different PersonalNo and NULL PersonalNo, but where PersonalNo is NULL 
Comp3       Jennifer    NULL          55443 -- Have the same StringId with other row where is provided PersonalNo it should be 0 (correct), with different PersonalNo where is no StringId shouldn't appear at all.
Comp1       Ralph       3961212256    ''    -- Shouldn't appear in select list, because only 1 row with this PersonalNo and there is no StringID

期望的输出

Peter     1
John      0
Lisa      0
Jennifer  0

查询

LEFT JOIN (SELECT Name,                 
                    (
                    SELECT CASE WHEN MIN(PersonalNo) <> MAX(d.PersonalNo) 
                                    and MIN(CASE WHEN StringId IS NULL THEN '0' ELSE StringId END) <> MAX(CASE WHEN d.StringId IS NULL THEN '0' ELSE d.StringId END) -- this is wrong                                                 
                                    and MIN(PersonalNo) <> ''
                                    and MIN(PersonalNo) IS NOT NULL                          
                                    and MAX(rn) > 1 THEN 1
                                 ELSE 0
                            END AS CheckPersonalNo 
                    FROM (                               
                        SELECT Name, PersonalNo, [StringId], ROW_NUMBER() OVER (PARTITION BY Name, PersonalNo ORDER BY Name) rn
                        FROM TableEmp e1 
                        WHERE Condition = 1 and e1.Name = d.Name                                 
                        ) sub2                              
                    GROUP BY Name
                    ) CheckPersonalNo                                                                                   
        FROM [TableEmp] d   
        WHERE Condition = 1
        GROUP BY Name        
        ) f ON f.Name = x.Name

查询的问题是我只能按Name 分组,不能将PersonalNo 添加到GROUP BY 子句,所以我需要在选择列表中使用聚合。但是现在它只比较 MINMAX 值,如果有超过 2 行具有相同的名称它不能按预期工作。

我需要做类似的事情,比较 PARTITION BY Fullname, PersonalNo 的值。它现在将值与相同的Name 进行比较(不依赖于PersonalNo)。

有什么想法吗?如果您有任何问题 - 问我,我会尽力解释。


更新 1

如果有两个PersonalNo不同的条目,但它们的StringId相等,则应该是1(错误)。

Company     Name    PersonalNo   StringId 
Comp1       Anna    4805250141    88552    -- different PersonalNo and the same StringId for both should go as 1 (wrong)
Comp1       Anna    4920225088    88552 

现在它返回如下:

Anna    0
Anna    0

应该是:

Anna    1

更新 2

UNION 更新Identifier 列后返回StringId: 55443(用于下面的数据),但在这种情况下,当1 个条目具有PersonalNo,另一个是blank,但它们都具有相同的(相等)StringId是正确的(应该是0)

Comp1       Jennifer    4920225088    55443  
Comp3       Jennifer    ''            55443

【问题讨论】:

  • 只有3家公司吗?名字是固定的吗?
  • 不,有数百家公司和数百万员工。 It is wrong的情况可以有更多,但我会自己做,我只需要Idea,如何按Name,PersonalNo进行分区比较,也可以根据PersonalNo比较值。
  • PersonalNo 可以不同(非空)但 stringid 相同,反之亦然吗?
  • @Sameer 是的,可以,所以在这种情况下它应该返回 1(错误)。

标签: sql sql-server sql-server-2008 tsql group-by


【解决方案1】:

希望我能理解您的要求..

也许还有其他方法可以做到这一点,但如果是我做的话,我个人可能会使用临时表进行临时工作..

--select data into a temp table that can be modified
select
    *
    into #cleaned
from 
    table


--apply personal numbers based on other records with matching string id
--you could take note of the records you are doing this to for data clean up
update c
    set c.personalNo = s.personalNo
from #cleaned as c
    inner join table as s
        on c.name = s.name
        and c.stringID = s.stringID
        and c.personalNo is null
        and s.personalNo is not null

--find all records with non matching string ids
select 
    name
    ,PersonalNo
    ,count(*) as numIDs
    into #issues
from(
    select
        name
        ,PersonalNo
        ,stringID
    from 
        #cleaned
    group by
        name
        ,PersonalNo
        ,stringID
    ) as i
group by
    name
    ,PersonalNo
having 
    count(*) > 1

--select data for viewing.
select
    distinct
    s.name
    ,case
        when i.name is not null then 1
        else 0
    end as issue
from
    #cleaned as s
    left outer join #issues as i
        on s.name = i.name
        and s.personalNo = i.personalNo
order by issue desc

SQLFiddle:http://sqlfiddle.com/#!3/f4aab/7

对不起,如果这里有错误,但我相信你会明白的,它不是火箭科学,只是另一种方法

编辑:刚刚注意到您对没有字符串 ID 的行感兴趣。如果它是唯一的行,那么这不是问题。我修改了第一个选择(进入#cleaned)以获取所有行。

编辑没有临时表现在你知道它在做什么了,这里没有任何临时表是一样的 - 但是警告这个更新分配缺失personalNo的源表

update c
    set c.personalNo = s.personalNo
from table1 as c
    inner join table1 as s
        on c.name = s.name
        and c.stringID = s.stringID
        and c.personalNo is null
        and s.personalNo is not null


select
    distinct
    s.name
    ,case
        when i.name is not null then 1
        else 0
    end as issue
from
    table1 as s
    left outer join (
                select 
                    name
                    ,PersonalNo
                    ,count(*) as numIDs
                from(
                    select
                        name
                        ,PersonalNo
                        ,stringID
                    from 
                        table1
                    group by
                        name
                        ,PersonalNo
                        ,stringID
                    ) as i
                group by
                    name
                    ,PersonalNo
                having 
                    count(*) > 1
        )
        as i
        on s.name = i.name
        and s.personalNo = i.personalNo
order by issue desc

SQLFiddle:http://sqlfiddle.com/#!3/f4aab/8

PARITIONING 我不知道我将如何在这里使用分区,因为你想要做的只是知道是否有不止一行,我使用更复杂的表格中的分区,或者我是否打算根据更复杂的规则对更新数据的判断调用的结果进行排名.. 但无论如何,这里是一个被划分的乌鸦:D

Select
    name
    ,personalNo
    ,case
        when numstrings > 1 then 1
        else 0 end as issue
from
    (select
        name
        ,personalNo
        ,row_number() over (partition by 
                                    name
                                    ,personalNo 
                                order by 
                                    name
                                    ,personalNo
                                    ,stringID
                                    ) as numstrings
    from
        #cleaned
    group by
        name
        ,personalNo
        ,stringid) as d
order by
    issue desc

注意:这使用了上面的#cleaned 表,因为我认为使这变得困难的关键是有时缺少个人编号。

没有临时表,没有更新

显然可以在没有任何临时表或更新任何内容的情况下使用上述方法,这只是可读性/可维护性的问题,以及它是否实际上更快。这可以更稳定地处理分配了多个personalNo的字符串ID:

select
    distinct
    s.name
    ,case
        when i.name is not null then 1
        else 0
    end as issue
from
    table1 as s
    left outer join (
                select 
                    name
                    ,PersonalNo
                    ,count(*) as numIDs
                from(
                    select
                        a.name
                        ,coalesce(a.PersonalNo,b.PersonalNo) as PersonalNo
                        ,a.stringID
                    from 
                        table1 as a
                            left outer join table1 as b
                                on a.name = b.name
                                and a.stringid=b.stringid
                                and a.personalNo != b.personalNo
                                and b.personalNo Is Not Null
                    group by
                        a.name
                        ,a.PersonalNo
                        ,a.stringID
                        ,b.PersonalNo
                    ) as i
                group by
                    name
                    ,PersonalNo
                having 
                    count(*) > 1
        )
        as i
        on s.name = i.name
        and s.personalNo = i.personalNo
order by issue desc

SQLFiddle:http://sqlfiddle.com/#!3/f4aab/9

编辑:也寻找不一致的个人号码 - 这使用一个临时表,但您可以像上一个示例中所做的那样将其换掉。注意与您询问的原始结构略有偏差因为这是我将如何完成这项任务的更多方式,但这里有足够多的代码供你以任何你想要的方式重新调整。

--select data into a temp table that can be modified
select
    *
    into #cleaned
from 
    table1


--apply personal numbers based on other records with matching string id
--you could take note of the records you are doing this to for data clean up
update c
    set c.personalNo = s.personalNo
from #cleaned as c
    inner join table1 as s
        on c.name = s.name
        and c.stringID = s.stringID
        and c.personalNo is null
        and s.personalNo is not null


Select
    IssueType
     ,Name
     ,Identifier
from 
    (
        --find all records with non matching PersonalNos
        select 
            name
            ,cast('StringID: ' + stringID as nvarchar(400)) as Identifier
            ,cast('Inconsistent  PersonalNo' as nvarchar(400)) as issueType
        from(
            select
                name
                ,PersonalNo
                ,stringID
            from 
                #cleaned
            group by
                name
                ,PersonalNo
                ,stringID
            ) as i
        group by
            name
            ,StringId
        having 
            count(*) > 1

    UNION    
        --find all records with non matching string ids

        select 
            name
            ,'PersonalNo: ' + PersonalNo
            ,cast('Inconsistent String ID' as nvarchar(400)) as issueType
        from(
            select
                name
                ,PersonalNo
                ,stringID
            from 
                #cleaned
            group by
                name
                ,PersonalNo
                ,stringID
            ) as i
        group by
            name
            ,PersonalNo
        having 
            count(*) > 1
    ) as a

SQLFiddle:http://sqlfiddle.com/#!3/e9da2/18

更新:也想接受空字符串personalNo's 这是另一个新要求.. 以与personalNo 中的NULL 相同的方式接受空字符串

--select data into a temp table that can be modified
select
    *
    into #cleaned
from 
    table1

--apply personal numbers based on other records with matching string id
--you could take note of the records you are doing this to for data clean up
update c
    set c.personalNo = s.personalNo
from #cleaned as c
    inner join table1 as s
        on c.name = s.name
        and c.stringID = s.stringID
        and  (c.personalNo IS NULL OR c.personalNo ='')
        and s.personalNo is not null
        and s.personalNo != ''


Select
     IssueType
     ,Name
     ,Identifier
from 
    (
        --find all records with non matching PersonalNos
        select 
            name
            ,cast('StringID: ' + stringID as nvarchar(400)) as Identifier
            ,cast('Inconsistent  PersonalNo' as nvarchar(400)) as issueType
        from(
            select
                name
                ,PersonalNo
                ,stringID
            from 
                #cleaned
            group by
                name
                ,PersonalNo
                ,stringID
            ) as i
        group by
            name
            ,StringId
        having 
            count(*) > 1

  UNION    
        --find all records with non matching string ids
        select 
            name
            ,'PersonalNo: ' + PersonalNo
            ,cast('Inconsistent String ID' as nvarchar(400)) as issueType
        from(
            select
                name
                ,PersonalNo
                ,stringID
            from 
                #cleaned
            group by
                name
                ,PersonalNo
                ,stringID
            ) as i
        group by
            name
            ,PersonalNo
        having 
            count(*) > 1
    ) as a

SQLFiddle:http://sqlfiddle.com/#!3/412127/8

【讨论】:

  • 谢谢你的回答,我希望这里有另一种方式来实现它。即使这似乎有效执行此查询也需要很长时间。在我的问题中,提供的查询用作连接表的子查询。有几百万条记录,要全部插入#temp表,以后用#temp连接表会很长,我错了吗?
  • 我会对其进行测试,我确实在数百万行之前使用过这种类型的东西,但在几秒钟或更短的时间内仍然没有问题。但有几个问题:1.这是一项数据清理任务吗?如果是这样,执行时间应该不是问题,2.是什么让您认为复杂的分区会很快? 3. 你能想出任何其他方法让它在不首先解决你的冲突的个人号码的情况下让它真正工作吗? ..我将在答案中添加“按解决方案分区”,但恕我直言,它在这里没有用 - 问题的真正根源是对齐那些personalNo的第一个
  • 这用于 Reporting Services,以查看所有员工的数据是否正确。我会测试它并告诉你它是否对我有用。
  • 添加了一些额外的信息 - 和一个非临时表版本,也许一些忍者有更好的方法,但至少你会有一些有用的东西:)
  • 如果是我,我希望获得企业批准,以根据 stringID 分配缺失的个人编号。这样你就可以直接找到答案,并且你不需要中间的临时表。或者就丢失的个人编号制作两份报告,一份针对错位数据的报告,这些数据还显示丢失的个人编号作为副产品,因为它可能更容易让企业理解。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-11-14
  • 2010-12-14
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多