【发布时间】:2019-04-25 23:58:30
【问题描述】:
我遇到过这种情况,人们要求在错误的地址上进行分组。而且我需要使用我拥有的工具/环境,我无法选择 Google API 或 3rd 方数据科学工具。我也做了我的硬件,看到几年前的帖子,所以仍然想检查所有可用的更新。 在我的场景中,人们希望将 ID 1-6 分组为单个,其余部分我添加用于否定测试。
SELECT * INTO #t FROM ( --test data: select * from #t drop table #t
SELECT 1 Id, '1 CROLANA HEIGHTS' Adr UNION -- A vs O
SELECT 2 Id, '1 CROLONA HEIGHTS' Adr union
SELECT 3 Id, '1 CROLONA HEIGHT DRIVE' Adr union
SELECT 4 Id,'1 CROLONA HEIGHTS DR' Adr union
SELECT 5 Id, '1 CROLONA HGHTS DR' Adr union
SELECT 6 Id, '1 CROLONA HTS DR' Adr UNION
---------------------------------------- rest should not match
SELECT 7 Id, '1 CORWING DR' Adr UNION
SELECT 8 Id, '1 SUNNYHILL DRIVE' Adr UNION
SELECT 9 Id, '1 CROWN HILL DR' Adr UNION
SELECT 10 Id, '1 ADDISON DRv' Adr ) a
------------------- and below is my fuzzy working script which can be improved)
SELECT id, adr, LEAD(adr,1) OVER ( ORDER BY adr ) adr_lead,
SOUNDEX(adr) Sdx, DIFFERENCE(adr, LEAD(adr,1) OVER ( ORDER BY adr )) diff
--- SOUNDEX(adr), COUNT(*) c
FROM #t
--GROUP BY SOUNDEX(adr)
WHERE SOUNDEX(adr) = SOUNDEX('1 CROLANA HEIGHTS')
【问题讨论】:
-
一种方法可能是首先使用多个
REPLACE将HEIGHTS的所有变体更改为一个,DRIVE也是如此。解决这个烂摊子需要几个小时,但我很确定,这至少会大大减少问题。对于模糊搜索,我建议将其拆分为片段并逐个进行比较。
标签: tsql fuzzy-comparison