不幸的是,没有交叉连接就无法做到这一点。归根结底,每个潜在用户都需要针对每个实际用户进行测试。
但是,Trino(以前称为Presto SQL)将在许多线程和机器上并行执行连接,因此在有足够硬件的情况下它可以非常快速地执行。请注意,在 Trino 中,中间结果从运算符流式传输到运算符,因此该查询没有 10M x 1k 行的“中间表”。
对于类似的查询
SELECT potential, min_by(actual, distance), min(distance)
FROM (
SELECT *, levenshtein_distance(potential, actual) distance
FROM actual_users, potential_users
)
GROUP BY potential
这是查询计划:
Query Plan
----------------------------------------------------------------------------------------------------------------
Fragment 0 [SINGLE]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Output[potential, _col1, _col2]
│ Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ _col1 := min_by
│ _col2 := min
└─ RemoteSource[1]
Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
Fragment 1 [HASH]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(FINAL)[potential]
│ Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ min := min("min_1")
│ min_by := min_by("min_by_0")
└─ LocalExchange[HASH] ("potential")
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ RemoteSource[2]
Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
Fragment 2 [SOURCE]
Output layout: [potential, min_1, min_by_0]
Output partitioning: HASH [potential]
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
├─ TableScan[memory:9, grouped = false]
│ Layout: [actual:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
│ actual := 0
└─ LocalExchange[SINGLE] ()
│ Layout: [potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[3]
Layout: [potential:varchar(5)]
Fragment 3 [SOURCE]
Output layout: [potential]
Output partitioning: BROADCAST []
Stage Execution Strategy: UNGROUPED_EXECUTION
TableScan[memory:8, grouped = false]
Layout: [potential:varchar(5)]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
potential := 0
(1 row)
特别是,对于本节,一旦交叉连接产生一行,它就会被输入到计算两个值之间的 Levenshtein 距离的投影算子中,然后输入到聚合中,每个聚合只存储一个组“潜在”用户。因此,此查询所需的内存量应该很低。
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED