SQL（或python）从相似的行中选择一次值答案

【问题标题】：SQL (or python) select value once from similar rowsSQL（或python）从相似的行中选择一次值
【发布时间】：2020-09-14 21:44:54
【问题描述】：

我在 Oracle 表中有数据，我需要根据由于分组而重复的唯一值的数量从中选择某些行。我的数据如下所示。

| LINE | BUCKET | TERM | COURSE     |
|------|--------|------|------------|
| 1001 | 1      | FA18 | COURSE 101 |
| 1001 | 1      | SP19 | COURSE 102 |
| 1001 | 1      | SP19 | COURSE 103 |
| 1001 | 1      | FA19 | COURSE 104 |
| 1001 | 2      | FA18 | COURSE 101 |
| 1001 | 2      | SP19 | COURSE 102 |
| 1001 | 2      | SP19 | COURSE 103 |
| 1001 | 2      | FA19 | COURSE 104 |
| 2001 | 1      | FA18 | COURSE 201 |
| 2001 | 1      | SP19 | COURSE 202 |
| 2001 | 1      | FA20 | COURSE 203 |
| 2001 | 2      | FA18 | COURSE 201 |
| 2001 | 2      | SP19 | COURSE 202 |
| 2001 | 2      | FA20 | COURSE 203 |
| 2001 | 3      | FA18 | COURSE 201 |
| 2001 | 3      | SP19 | COURSE 202 |
| 2001 | 3      | FA20 | COURSE 203 |

数据中有两行。第一行 (1001) 有 2 个不同的桶和 4 个不同的路线。第二行（2001 年）有 3 个不同的桶和 3 个不同的课程。我需要为一行中的每门课程只选择 1 行，并尽可能多地选择桶。数学很简单：

第 1001 行：4（课程）/2（存储桶）= 每个存储桶 2 道菜
2001 行：3（课程）/3（存储桶）= 每个存储桶 1 道菜

如何在多个存储桶中每行选择一次课程以使其看起来像这样？

| LINE | BUCKET | TERM | COURSE     |
|------|--------|------|------------|
| 1001 | 1      | FA18 | COURSE 101 |
| 1001 | 1      | SP19 | COURSE 102 |
| 1001 | 2      | SP19 | COURSE 103 |
| 1001 | 2      | FA19 | COURSE 104 |
| 2001 | 1      | FA18 | COURSE 201 |
| 2001 | 2      | SP19 | COURSE 202 |
| 2001 | 3      | FA20 | COURSE 203 |

解决方案可以是 SQL 或 python。

【问题讨论】：

标签： python sql oracle

【解决方案1】：

基本思路是row_number()。如果您只需要随机抽样桶：

select t.*
from (select t.*,
             row_number() over (partition by line, course order by dbms_random.random) as seqnum
      from t
     ) t
where seqnum = 1;

如果你真的想对桶进行循环（以保证选择最大数量），那么随机是不够的：

select t.*
from (select t.*,
             row_number() over (partition by line, course order by seqnum_bucket, dbms_random.random) as seqnum
      from (select t.*
                   row_number() over (partition by line, course, bucket order by dbms_random.random) as seqnum_bucket
            from t
           ) t
     ) t
where seqnum = 1;

【讨论】：

@JElwood 。 . .您可以将random 替换为您喜欢的任何列。您的问题听起来像是您想要一个随机样本。

【解决方案2】：

如果你的桶总是从 1 开始并且增量=1，你可以使用非常简单的mod(row_number, max(bucket))：

select
   line, term, course
  ,1+mod(-1+row_number()over(partition by line order by course),max(bucket)) bucket_n
from t 
group by line, term, course
order by line,course;

带有样本数据的完整示例：

with t(LINE ,BUCKET ,TERM ,COURSE) as (
   select 1001, 1, 'FA18', 'COURSE 101' from dual union all
   select 1001, 1, 'SP19', 'COURSE 102' from dual union all
   select 1001, 1, 'SP19', 'COURSE 103' from dual union all
   select 1001, 1, 'FA19', 'COURSE 104' from dual union all
   select 1001, 2, 'FA18', 'COURSE 101' from dual union all
   select 1001, 2, 'SP19', 'COURSE 102' from dual union all
   select 1001, 2, 'SP19', 'COURSE 103' from dual union all
   select 1001, 2, 'FA19', 'COURSE 104' from dual union all
   select 2001, 1, 'FA18', 'COURSE 201' from dual union all
   select 2001, 1, 'SP19', 'COURSE 202' from dual union all
   select 2001, 1, 'FA20', 'COURSE 203' from dual union all
   select 2001, 2, 'FA18', 'COURSE 201' from dual union all
   select 2001, 2, 'SP19', 'COURSE 202' from dual union all
   select 2001, 2, 'FA20', 'COURSE 203' from dual union all
   select 2001, 3, 'FA18', 'COURSE 201' from dual union all
   select 2001, 3, 'SP19', 'COURSE 202' from dual union all
   select 2001, 3, 'FA20', 'COURSE 203' from dual
)
select
   line, term, course
  ,1+mod(-1+row_number()over(partition by line order by course),max(bucket)) bucket_n
from t 
group by line, term, course
order by line,course;

结果：

LINE    TERM    COURSE      BUCKET_N
------- ------- ----------- --------
1001    FA18    COURSE 101  1
1001    SP19    COURSE 102  2
1001    SP19    COURSE 103  1
1001    FA19    COURSE 104  2
2001    FA18    COURSE 201  1
2001    SP19    COURSE 202  2
2001    FA20    COURSE 203  3

另一个有趣的变体是聚合桶并按位置提取值mod(rownumber, count(buckets)) - 与之前的解决方案相反，它适用于任何桶：

select
   line, term, course
  ,xmlcast(
     xmlelement(
        "buckets",  
        xmlagg(xmlelement("bucket", bucket))
     ).extract('/buckets/*['||
      (1+mod(-1+row_number()over(partition by line order by course),count(bucket)))
     ||']')
    as int) bucket_n_2
from t 
group by line, term, course
order by line,course;

完整的测试用例：

with t(LINE ,BUCKET ,TERM ,COURSE) as (
   select 1001, 1, 'FA18', 'COURSE 101' from dual union all
   select 1001, 1, 'SP19', 'COURSE 102' from dual union all
   select 1001, 1, 'SP19', 'COURSE 103' from dual union all
   select 1001, 1, 'FA19', 'COURSE 104' from dual union all
   select 1001, 2, 'FA18', 'COURSE 101' from dual union all
   select 1001, 2, 'SP19', 'COURSE 102' from dual union all
   select 1001, 2, 'SP19', 'COURSE 103' from dual union all
   select 1001, 2, 'FA19', 'COURSE 104' from dual union all
   select 2001, 1, 'FA18', 'COURSE 201' from dual union all
   select 2001, 1, 'SP19', 'COURSE 202' from dual union all
   select 2001, 1, 'FA20', 'COURSE 203' from dual union all
   select 2001, 2, 'FA18', 'COURSE 201' from dual union all
   select 2001, 2, 'SP19', 'COURSE 202' from dual union all
   select 2001, 2, 'FA20', 'COURSE 203' from dual union all
   select 2001, 3, 'FA18', 'COURSE 201' from dual union all
   select 2001, 3, 'SP19', 'COURSE 202' from dual union all
   select 2001, 3, 'FA20', 'COURSE 203' from dual
)
select
   line, term, course
  ,1+mod(-1+row_number()over(partition by line order by course),max(bucket)) bucket_n
  ,xmlcast(
     xmlelement(
        "buckets",  
        xmlagg(xmlelement("bucket", bucket))
     ).extract('/buckets/*['||
      (1+mod(-1+row_number()over(partition by line order by course),count(bucket)))
     ||']')
    as int) bucket_n_2
from t 
group by line, term, course
order by line,course;

xmlagg(xmlelement("bucket", bucket)) 聚合所有桶数。
extract('/buckets/*[N]) - 从聚合值中提取 N 的桶
(1+mod(-1+row_number()over(partition by line order by course),count(bucket))) - 计算第 N 个桶

结果：BUCKET_N - 以前的，BUCKET_N_2 - 新变种：

LINE    TERM    COURSE      BUCKET_N    BUCKET_N_2
1001    FA18    COURSE 101          1           1
1001    SP19    COURSE 102          2           2
1001    SP19    COURSE 103          1           1
1001    FA19    COURSE 104          2           2
2001    FA18    COURSE 201          1           1
2001    SP19    COURSE 202          2           3
2001    FA20    COURSE 203          3           2

【讨论】：