【问题标题】:Fuzzy join with Levenshtein distanceLevenshtein 距离的模糊连接
【发布时间】:2021-02-01 16:36:31
【问题描述】:

我有一个包含名为“potential_users”的用户名(约 1 000 行)和另一个名为“actual_users”(约 1000 万行)的表。所有记录都完全由 [a-z] 字符组成,没有空格。另外,我知道potential_users 都不在actual_users 表中。

我希望能够根据 Levenshtein 距离为潜在用户中的每一行计算实际用户中最接近的记录。例如:

| potential_users|
|----------------|
| user1          |
| kajd           |
| bbbbb          |

| actual_users |
|--------------|
| kaj          |
| bbbbbbb      |
| user         |

会返回:

| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1           | user         | 1                    |
| kajd            | kaj          | 1                    |
| bbbbb           | bbbbbbb      | 2                    |

如果表很短,我可以创建一个交叉连接,为 potential_users 中的每条记录计算实际用户中的 Levenshtein 距离,然后返回具有最低值的那个。但是,在我的情况下,这将创建一个 1 000 x 10 000 000 行的中间表,这有点不切实际。

有没有更简洁的方法来创建交叉连接来执行此类操作?

【问题讨论】:

    标签: sql hive apache-spark-sql presto trino


    【解决方案1】:

    不幸的是,没有交叉连接就无法做到这一点。归根结底,每个潜在用户都需要针对每个实际用户进行测试。

    但是,Trino(以前称为Presto SQL)将在许多线程和机器上并行执行连接,因此在有足够硬件的情况下它可以非常快速地执行。请注意,在 Trino 中,中间结果从运算符流式传输到运算符,因此该查询没有 10M x 1k 行的“中间表”。

    对于类似的查询

    SELECT potential, min_by(actual, distance), min(distance)
    FROM (
        SELECT *, levenshtein_distance(potential, actual) distance
        FROM actual_users, potential_users
    )
    GROUP BY potential
    

    这是查询计划:

                                                       Query Plan                                                   
    ----------------------------------------------------------------------------------------------------------------
     Fragment 0 [SINGLE]                                                                                            
         Output layout: [potential, min_by, min]                                                                    
         Output partitioning: SINGLE []                                                                             
         Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
         Output[potential, _col1, _col2]                                                                            
         │   Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]                                          
         │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                
         │   _col1 := min_by                                                                                        
         │   _col2 := min                                                                                           
         └─ RemoteSource[1]                                                                                         
                Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]                                       
                                                                                                                    
     Fragment 1 [HASH]                                                                                              
         Output layout: [potential, min_by, min]                                                                    
         Output partitioning: SINGLE []                                                                             
         Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
         Aggregate(FINAL)[potential]                                                                                
         │   Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]                                          
         │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                
         │   min := min("min_1")                                                                                    
         │   min_by := min_by("min_by_0")                                                                           
         └─ LocalExchange[HASH] ("potential")                                                                       
            │   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]    
            │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             
            └─ RemoteSource[2]                                                                                      
                   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))] 
                                                                                                                    
     Fragment 2 [SOURCE]                                                                                            
         Output layout: [potential, min_1, min_by_0]                                                                
         Output partitioning: HASH [potential]                                                                      
         Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
         Aggregate(PARTIAL)[potential]                                                                              
         │   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]       
         │   min_1 := min("levenshtein_distance")                                                                   
         │   min_by_0 := min_by("actual", "levenshtein_distance")                                                   
         └─ Project[]                                                                                               
            │   Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]                      
            │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             
            │   levenshtein_distance := levenshtein_distance("potential", "actual")                                 
            └─ CrossJoin                                                                                            
               │   Layout: [actual:varchar(7), potential:varchar(5)]                                                
               │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                          
               │   Distribution: REPLICATED                                                                         
               ├─ TableScan[memory:9, grouped = false]                                                              
               │      Layout: [actual:varchar(7)]                                                                   
               │      Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}                                     
               │      actual := 0                                                                                   
               └─ LocalExchange[SINGLE] ()                                                                          
                  │   Layout: [potential:varchar(5)]                                                                
                  │   Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}                                      
                  └─ RemoteSource[3]                                                                                
                         Layout: [potential:varchar(5)]                                                             
                                                                                                                    
     Fragment 3 [SOURCE]                                                                                            
         Output layout: [potential]                                                                                 
         Output partitioning: BROADCAST []                                                                          
         Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
         TableScan[memory:8, grouped = false]                                                                       
             Layout: [potential:varchar(5)]                                                                         
             Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}                                              
             potential := 0                                                                                         
                                                                                                                    
                                                                                                                    
    (1 row)
    
    

    特别是,对于本节,一旦交叉连接产生一行,它就会被输入到计算两个值之间的 Levenshtein 距离的投影算子中,然后输入到聚合中,每个聚合只存储一个组“潜在”用户。因此,此查询所需的内存量应该很低。

         Aggregate(PARTIAL)[potential]                                                                              
         │   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]       
         │   min_1 := min("levenshtein_distance")                                                                   
         │   min_by_0 := min_by("actual", "levenshtein_distance")                                                   
         └─ Project[]                                                                                               
            │   Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]                      
            │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             
            │   levenshtein_distance := levenshtein_distance("potential", "actual")                                 
            └─ CrossJoin                                                                                            
               │   Layout: [actual:varchar(7), potential:varchar(5)]                                                
               │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                          
               │   Distribution: REPLICATED 
    

    【讨论】:

      【解决方案2】:

      我认为你不能用简单的 join 来做到这一点,有整个算法来计算它。看看这篇文章展示了Levenshtein距离算法在sql中的实现:

      https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=1

      【讨论】:

        猜你喜欢
        • 2011-07-03
        • 1970-01-01
        • 2014-04-20
        • 1970-01-01
        • 2014-04-14
        • 2014-08-04
        • 2015-01-28
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多