计算不同记录的窗口函数答案

【问题标题】：Window functions to count distinct records计算不同记录的窗口函数
【发布时间】：2012-11-08 23:21:36
【问题描述】：

下面的查询基于一个复杂的视图，并且该视图按我的意愿工作（我不打算包含该视图，因为我认为它不会帮助解决手头的问题）。我不能正确的是drugCountsinFamilies 列。我需要它来显示每个药物家族的distinct drugNames 数量。您可以从第一个屏幕截图中看到有三个不同的 H3A 行。 H3A 的drugCountsInFamilies 应该是 3（有三种不同的 H3A 药物。）

您可以从第二个屏幕截图中看到，第一个屏幕截图中的 drugCountsInFamilies 正在捕获列出药物名称的行数。

以下是我的问题，cmets 在不正确的部分

select distinct
     rx.patid
    ,d2.fillDate
    ,d2.scriptEndDate
    ,rx.drugName
    ,rx.drugClass
    --the line directly below is the one that I can't figure out why it's wrong
    ,COUNT(rx.drugClass) over(partition by rx.patid,rx.drugclass,rx.drugname) as drugCountsInFamilies
from 
(
select 
    ROW_NUMBER() over(partition by d.patid order by d.patid,d.uniquedrugsintimeframe desc) as rn
    ,d.patid
    ,d.fillDate
    ,d.scriptEndDate
    ,d.uniqueDrugsInTimeFrame
    from DrugsPerTimeFrame as d
)d2
inner join rx on rx.patid = d2.patid
inner join DrugTable as dt on dt.drugClass=rx.drugClass
where d2.rn=1 and rx.fillDate between d2.fillDate and d2.scriptEndDate
and dt.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
order by rx.patid

如果我尝试向 count(rx.drugClass) 子句添加不同的内容，SSMS 会发疯。可以用窗口函数来完成吗？

【问题讨论】：

@littlebobbytables 我无法使用count(distinct rx.drugClass) over(partition by...) 而不出现错误。
@LittleBobbyTables 不起作用，因为我需要两个 different 同一类药物。如果同一个 drugName 列了两次，也是同一个类，但是我需要算一次。
您的请求没有意义，因为您在分区子句中有 rx.drugClass。因此，count(distinct rx.drugClass) 将始终返回 1。
我不认为这是正确的，但我在阳光下尝试了一切明智的分区，让它给我正确的答案。我可能最终只使用派生表，但我希望我能以这种方式解决它。

标签： sql sql-server-2008 tsql

【解决方案1】：

将count(distinct) 用作Windows 函数需要一个技巧。实际上有几个级别的技巧。

因为您的请求实际上非常简单——该值始终为 1，因为 rx.drugClass 在分区子句中——我将做一个假设。假设您要计算每个患者的独特药物类别的数量。

如果是这样，请执行由 patid 和 drugClass 分区的row_number()。当这是 1 时，在一个 patid 内，一个新的 drugClass 开始了。创建一个标志，在这种情况下为 1，在所有其他情况下为 0。

然后，您可以简单地使用带有分区子句的sum 来获取不同值的数量。

查询（格式化后以便我可以阅读）看起来像：

select rx.patid, d2.fillDate, d2.scriptEndDate, rx.drugName, rx.drugClass,
       SUM(IsFirstRowInGroup) over (partition by rx.patid) as NumDrugCount
from (select distinct rx.patid, d2.fillDate, d2.scriptEndDate, rx.drugName, rx.drugClass,
             (case when 1 = ROW_NUMBER() over (partition by rx.drugClass, rx.patid order by (select NULL))
                   then 1 else 0
              end) as IsFirstRowInGroup
      from (select ROW_NUMBER() over(partition by d.patid order by d.patid,d.uniquedrugsintimeframe desc) as rn, 
                   d.patid, d.fillDate, d.scriptEndDate, d.uniqueDrugsInTimeFrame
            from DrugsPerTimeFrame as d
           ) d2 inner join
           rx
           on rx.patid = d2.patid inner join
           DrugTable dt
           on dt.drugClass = rx.drugClass
      where d2.rn=1 and rx.fillDate between d2.fillDate and d2.scriptEndDate and
            dt.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
     ) t
order by patid

【讨论】：

您知道格式化 SQL 背后的逻辑的好资源吗？因为我认为可读的内容以及 SQL Server 和其他人在 SO 上所说的内容是两件非常不同的事情。感谢您的回答！ :)
@wootscootinboogie 。 . .在我使用 SQL 的这些年（几十年）中，我的风格相当个人化。我想我在“使用 SQL 和 Excel 进行数据分析”的第 1 章中对其进行了描述。我发现当我尝试适应完全不同的格式时，我往往会错过语言的重要元素。
@GordonLinoff case when 1 = ROW_NUMBER()... 用于在下一个在线视图中进行稍后的总结，这是一个绝妙的大师之举。
@DougPorter 。 . .请不要帮我重新格式化我的代码。您可以在“使用 SQL 和 Excel 进行数据分析”的第 1 章中了解我的缩进样式。这是故意的。

【解决方案2】：

为什么这样的东西不起作用？

SELECT 
   IDCol_1
  ,IDCol_2
  ,Count(*) Over(Partition By IDCol_1, IDCol_2 order by IDCol_1) as numDistinct
FROM Table_1

【讨论】：

它对 2012 年的答案有何贡献？注意：order by 在 count 的窗口中毫无意义。
这是一个问题还是一个答案？

【解决方案3】：

我遇到这个问题是为了解决我计算不同值的问题。在寻找答案时，我遇到了这个post。见最后一条评论。我已经对其进行了测试并使用了 SQL。它对我来说非常有效，我想我会在这里提供另一种解决方案。

总之，使用DENSE_RANK()，将PARTITION BY 用于分组列，将ORDER BY 用于要计数的列上的ASC 和DESC：

DENSE_RANK() OVER (PARTITION BY drugClass ORDER BY drugName ASC) +
DENSE_RANK() OVER (PARTITION BY drugClass ORDER BY drugName DESC) - 1 AS drugCountsInFamilies

我用这个作为自己的模板。

DENSE_RANK() OVER (PARTITION BY PartitionByFields ORDER BY OrderByFields ASC ) +
DENSE_RANK() OVER (PARTITION BY PartitionByFields ORDER BY OrderByFields DESC) - 1 AS DistinctCount

我希望这会有所帮助！

【讨论】：

@tibtib 我建议写出你的完整回复，这样你的目标就很明确了。
我之前的评论不正确，所以我删除了它。在基本层面上，我只是注意到 DENSE_RANK 的 MAX 可以以稍微更直观的方式给出相同的结果。您只需将聚合 MAX 包装在窗口函数中即可进行正确分区。所以，比如：SELECT class, name, MAX(MAX(dense_rank)) OVER (PARTITION BY class) FROM (SELECT class, name, DENSE_RANK() OVER (PARTITION BY class ORDER BY name)) AS ex_table GROUP BY 1, 2
-1 错误：假设您有三列：time、animalId、cameraId。我想为每只动物选择不同数量的相机。这将为您提供所有结果的 1。
如果我PARTITION BY animalId, cameraId。否则，如果按单列分区，将针对所有窗口而不是针对每个动物进行分区
@wootscootinboogie 您应该能够将此标记为已接受的正确答案。比最初标记的要容易得多。

【解决方案4】：

我认为你试图做的是作为一个窗口函数：

COUNT(DISTINCT rx.drugName) over(partition by rx.patid,rx.drugclass) as drugCountsInFamilies

SQL 抱怨的。但你可以这样做：

SELECT 
rx.patid
, rx.drugName
, rx.drugClass
, (SELECT COUNT(DISTINCT rx2.drugName) FROM rx rx2 WHERE rx2.drugClass = rx.DrugClass AND rx2.patid = rx.patid) As drugCountsInFamilies
FROM rx
...

如果表很大，那么最好在其中一列（例如 patid）上放置一个索引，这样嵌套查询就不会消耗大量资源。

【讨论】：