Branko 接受的解决方案很棒(谢谢!)。但是,我想提供一个性能相同的替代方案(根据我的测试),并且可能更易于可视化。
让我们回顾一下。原来的问题或许可以概括如下:
给定一个 id 和相对权重的地图,创建一个查询,该查询在地图中返回一个随机 id,但概率与其相对权重成正比。
注意强调的是相对权重,而不是百分比。正如 Branko 在他的回答中指出的那样,使用相对权重适用于任何事情,包括百分比。
现在,考虑一些测试数据,我们将把它们放在一个临时表中:
CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
) AS test(id, weight);
请注意,我使用的示例比原始问题中的示例更复杂,因为它不方便地加起来为 100,并且 em>相同的权重 (20) 被多次使用(对于 id 2 和 3),这一点很重要,稍后您会看到。
我们要做的第一件事就是把权重变成从0到1的概率,无非就是简单的归一化(weight / sum(weights)):
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
这将导致以下输出:
id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
1 | 25 | 0.5 | 0.0 | 0.5
2 | 10 | 0.2 | 0.5 | 0.7
3 | 10 | 0.2 | 0.7 | 0.9
4 | 5 | 0.1 | 0.9 | 1.0
诚然,上面的查询所做的工作超出了我们的需要,但我发现以这种方式可视化相对概率很有帮助,而且它确实使选择 id 的最后一步变得微不足道:
SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;
现在,让我们将所有内容与一个确保查询返回具有预期分布的数据的测试结合在一起。我们将使用generate_series() 生成一个百万次的随机数:
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
),
fp AS ( -- final probability
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;
这将导致类似于以下的输出:
id | count
----+--------
1 | 499679
3 | 200652
2 | 199334
4 | 100335
如您所见,它完美地跟踪了预期分布。
性能
上面的查询非常有效。即使在我的普通机器上,PostgreSQL 在 WSL1 实例中运行(太可怕了!),执行速度也相对较快:
count | time (ms)
-----------+----------
1,000 | 7
10,000 | 25
100,000 | 210
1,000,000 | 1950
适应生成测试数据
在为单元/集成测试生成测试数据时,我经常使用上述查询的变体。这个想法是生成近似于跟踪现实的概率分布的随机数据。
在这种情况下,我发现计算开始和结束分布一次并将结果存储在表格中很有用:
CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
),
p AS ( -- probability
SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
然后我可以重复使用这些预先计算的概率,从而获得额外的性能和更简单的使用。
我什至可以将它全部封装在一个函数中,我可以在任何时候调用它来获取随机 id:
CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
SELECT id
FROM test
WHERE p_random BETWEEN startprobability AND endprobability
;
$$
LANGUAGE SQL STABLE STRICT
窗口函数框架
值得注意的是,上面的技术是使用带有非标准框架ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW的窗口函数。这是处理某些权重可能重复的事实所必需的,这就是为什么我首先选择具有重复权重的测试数据!