将 SQL 结果集分区为块答案

【问题标题】：Partition SQL result set into chunks将 SQL 结果集分区为块
【发布时间】：2015-04-11 06:39:57
【问题描述】：

在为批处理做准备时，我需要对记录组进行分区，以便运行作业的并行流。这些记录来自一个可能有数百万行的表。我的目标是将这些记录（按主键）均匀地分解成（大约）均匀的块，然后可以并行处理。我想动态选择块大小。可能还值得注意的是，主键序列中可能存在间隙。

换句话说，给定这个表，谓词表示块的数量，结果集提供块的第一个和最后一个序列：

  seq    name   |
-------|--------|
1      | john   |
2      | joe    |
3      | joe    |
4      | joe    |
5      | joe    |
567    | kent   |
568    | katie  |
20000  | sue    |
200016 | jill   |
200027 | bill   |

我会得到以下结果 where (number-of-chunks) -> (first-seq, last-seq):

(2) -> (1,5),(567,20027)
(5) -> (1,2),(3,4),(5,567),(568,20000),(200016,200027)

或者，作为结果集，类似这样的东西（当要求 5 个块时）：

first_seq last_seq ------------|----------| 1 | 2 | 3 | 4 | 5 |第567章第568章200000 | 20016 | 200027 |

我假设某种窗口函数在这里是有序的，但我不确定如何解决这个问题。谁能帮我查询一下？

【问题讨论】：

能否以表格形式添加预期的输出
您所说的“块大小”似乎是您想要的块数，对吗？（我通常会将“块大小”读作每个块中出现的项目数。）
这将与 SQLServer 或 MySQL 一起使用吗？
@DaveCosta 是的，你是对的，那是误导，我会编辑。
@ChrisduPreez 理想的解决方案是与 DB 无关，但它至少必须在 DB2 和 Oracle 上工作。

标签： sql

【解决方案1】：

认为这应该适用于大多数数据库系统。

1) 已将chunk 放在字段列表中以更详细； order by 也一样

2) 使用 ...(10 / (num_rows +... 将序列拆分为 10 块。

select MIN(seq) as first_seq, MAX(seq) as last_seq, chunk from
        /*- Basic grouping formula pseudo: #row_chunk_number = round-up( ( #total_num_chunks / #total_num_rows ) x #current_row_num )
          - The +0.0 is to convert field values to floats
          - floor() + 1 means the same as rounding up ... and im not sure if ceil() exists on all DB systems.
        */
        (select seq, floor(((10 / (num_rows + 0.0)) + 0.0) * (row_num + 0.0)) + 1 as chunk from
        (select 
            seq,
            /*`row_num` is the row number in the sequence range - achieved by iteratively counting all sequences smaller than current (assuming seq is unique and numeric).*/
            (select COUNT(*) from table1 as b where b.seq < a.seq) as row_num,
            /*`num_rows` is the number of rows in the sequence range - added to inner query to prevent cluttering the actual math calc in the outer query (same performance).*/
            (select COUNT(*) from table1 ) as num_rows
        /*dat1 is a derived table of seq (id), num_rows (total number rows) and row_num (row number)*/
        from table1 as a) as dat) as dat1 
group by chunk
order by chunk

【讨论】：

这行得通！您介意在 SQL 中添加一些 cmets 来解释逻辑吗？其中一些有点不直观（但也许这只是我）。
在我的解决方案中添加了更多的 cmets。也许赞成我的答案的有用性？
太棒了！投票并标记为答案。谢谢！

【解决方案2】：

NTILE 函数可能适用于 Oracle（我不确定 DB2）：

SELECT seq, ntile( 2 ) over (order by seq) chunk_num
  FROM my_table

（其中 2 是块的数量）

或者在你描述的布局中得到结果：

SELECT chunk_num, MIN(seq), MAX(seq) FROM (
  SELECT seq, ntile( 2 ) over (order by seq) chunk_num
    FROM my_tab
  )
  GROUP BY chunk_num

如果chunk个数不均分行数，则将超出部分放入编号较小的chunk中。

【讨论】：

我记得很久以前用过这个功能，完全忘记了。很好的答案。
这绝对符合我正在寻找的内容......如果它确实受 DB2 支持的话。根据使用 NTILE 的建议，我通过一些快速的谷歌搜索找到了this gem。明天当我再次访问我的数据库时，我会检查一下，看看它是否有效。谢谢@DaveCosta
不幸的是，这在 DB2 上不起作用，所以虽然这将是我的首选解决方案，但@ChrisduPreez 的解决方案与数据库无关。