【问题标题】:How to find the most common value in each column如何在每列中找到最常见的值
【发布时间】:2021-10-09 10:12:18
【问题描述】:

我有一张如下所示的表格:

category    name1    name2    name3    name4    name5

  1         John      Sam     John     Katy      Cat
  1         John      Ivan    Bob      Andrew    Tom
  1         Sam       Ivan    George   Bob       Tom
  2         Jack      Siri    Elsa     Noah      Anna
  2         Jack      Bob     Tomas    Noah      Tom

我需要做的是在每个类别的每个列中找到最常见的值。也就是说,我需要以下结果:

category    name1    name2    name3    name4    name5

  1         John      Ivan    John     Katy      Tom
  2         Jack      Siri    Elsa     Noah      Anna

如果有多个具有相同频率的值,则可以选择其中任何一个。

到目前为止,我只用这个脚本为一栏做到了这一点:

SELECT top(1) category, name1, COUNT(name1) AS freq
FROM data
GROUP BY category, name1
ORDER BY freq DESC

但是我该如何对 SQL Server 中的多个列执行此操作?

【问题讨论】:

  • So far, I have only found the code to do that with one column 。为什么不复制其他列的代码
  • 显示你为 1 列所做的代码
  • 根据问题指南,请展示您的尝试并告诉我们您发现了什么(在本网站或其他地方)以及为什么它不能满足您的需求。
  • category2, name2 : Siri 和 Bob 同样流行,那你为什么选择 Siri?
  • @Adamszsz 将代码添加到问题中

标签: sql sql-server tsql window-functions


【解决方案1】:

如果您不介意按行显示结果,您可以取消透视,这样会更简单:

select category, which, name
from (select t.category, v.which, v.name, count(*) as cnt,
             row_number() over (partition by t.category, v.which order by count(*) desc) as seqnum
      from t cross apply
           (values (1, name1), (2, name2), (3, name3), (4, name4), (5, name4)
           ) v(which, name)
      from t
      group by t.category, v.which, v.name
      ) cwn
where seqnum = 1;

如果你想在列中重新透视:

with cwn as (
      select t.category, v.which, v.name, count(*) as cnt,
             row_number() over (partition by t.category, v.which order by count(*) desc) as seqnum
      from t cross apply
           (values (1, name1), (2, name2), (3, name3), (4, name4), (5, name4)
           ) v(which, name)
      from t
      group by t.category, v.which, v.name
     )
select category,
       max(case when which = 1 then name end) as name1,
       max(case when which = 2 then name end) as name2,
       max(case when which = 3 then name end) as name3,
       max(case when which = 4 then name end) as name4,
       max(case when which = 5 then name end) as name5
from cwn
where seqnum = 1
group by category

【讨论】:

  • 非常感谢您的帮助!
  • @OlegIvanytskyi 。 . .您可能还会发现这是最快的方法。
  • 最快?为什么?你测试了吗?你能证明这是最快的吗?
【解决方案2】:

一个选项,有很多重复,但要适应您当前的结构...

(尽管在处理同样频繁的名称时它仍然具有相同的非确定性/任意行为)

WITH
  counted AS
(
  SELECT
    category,
    name1,
    COUNT(*) OVER (PARTITION BY category, name1)  AS name1_freq,
    name2,
    COUNT(*) OVER (PARTITION BY category, name2)  AS name2_freq
  FROM
    yourTable
),
  ranked AS
(
  SELECT
    category,
    name1,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY name1_freq DESC)  AS name1_rank,
    name2,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY name2_freq DESC)  AS name2_rank
  FROM
    counted
)
SELECT
  category,
  MAX(CASE WHEN name1_rank = 1 THEN name1 END)  AS name1_most_common,
  MAX(CASE WHEN name2_rank = 1 THEN name2 END)  AS name2_most_common
FROM
  ranked
GROUP BY
  category

如您所见,这是很多重复。这就是为什么 SQL 以规范化数据结构为基础的原因。如此之多以至于规范化然后去规范化你的结构可能是一个有效的选择......

WITH
  normalised(category, col, name) AS
(
            SELECT category, 1, name1 FROM yourTable 
  UNION ALL SELECT category, 2, name2 FROM yourTable
),
  counted AS
(
  SELECT
    category, col, name, COUNT(*) AS freq
  FROM
    normalised
  GROUP BY
    category, col, name
),
  ranked AS
(
  SELECT
    category, col, name,
    ROW_NUMBER() OVER (PARTITION BY category, col ORDER BY freq DESC)  AS rank
  FROM
    counted
)
SELECT
  category,
  MAX(CASE WHEN col = 1 THEN name END)  AS name1_most_common,
  MAX(CASE WHEN col = 2 THEN name END)  AS name2_most_common
FROM
  ranked
WHERE
  rank = 1
GROUP BY
  category

【讨论】:

    【解决方案3】:

    首先创建一个CTE,使用COUNT()窗口函数返回每个名称在每个类别中出现的次数,然后使用FIRST_VALUE()窗口函数为每一列获取出现次数最多的名称:

    WITH cte AS (
      SELECT *,
             COUNT(*) OVER (PARTITION BY category, name1) count1,
             COUNT(*) OVER (PARTITION BY category, name2) count2,
             COUNT(*) OVER (PARTITION BY category, name3) count3,
             COUNT(*) OVER (PARTITION BY category, name4) count4,
             COUNT(*) OVER (PARTITION BY category, name5) count5
      FROM tablename
    )
    SELECT DISTINCT category,
           FIRST_VALUE(name1) OVER (PARTITION BY category ORDER BY count1 DESC) name1,
           FIRST_VALUE(name2) OVER (PARTITION BY category ORDER BY count2 DESC) name2,
           FIRST_VALUE(name3) OVER (PARTITION BY category ORDER BY count3 DESC) name3,
           FIRST_VALUE(name4) OVER (PARTITION BY category ORDER BY count4 DESC) name4,
           FIRST_VALUE(name5) OVER (PARTITION BY category ORDER BY count5 DESC) name5
    FROM cte
    

    请参阅demo

    【讨论】:

    • 简洁优雅。非常感谢!
    • @OlegIvanytskyi - 我建议(由于您的数据结构,而不是由于 forpas)这违反了 DRY 原则(不要重复自己),因此并不优雅。解决方案应该是“修复”您的数据结构,或者动态旋转它(根据其他答案)。必须为每一列重复所有真正的逻辑是一种强烈的代码气味。我也很想知道哪个性能最高/成本最低。 (想象一下,如果每个名字只有一行,而不是每个名字一列,这个答案会有多清晰......)
    • @MatBailie 那确实很棒,但是,我无法更改此结构,因为此表不是我创建的。我一定会看看其他答案,但这个简单而简短,这就是为什么我将其标记为我的问题的答案
    猜你喜欢
    • 2021-09-22
    • 1970-01-01
    • 2012-08-31
    • 2021-05-21
    • 1970-01-01
    • 2012-08-27
    • 2012-11-12
    • 2017-11-08
    相关资源
    最近更新 更多