【问题标题】:Eliminate duplicate cities from database从数据库中消除重复城市
【发布时间】:2011-04-28 10:08:35
【问题描述】:

背景

超过 5300 行重复:

"id","latitude","longitude","country","region","city"
"2143220","41.3513889","68.9444444","KZ","10","Abay"
"2143218","40.8991667","68.5433333","KZ","10","Abay"
"1919381","33.8166667","49.6333333","IR","34","Ab Barik"
"1919377","35.6833333","50.1833333","IR","19","Ab Barik"
"1919432","29.55","55.5122222","IR","29","`Abbasabad"
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
"1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
"1919422","33.8666667","48.3","IR","23","`Abbasabad"
"1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
"1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
"1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
"1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
"1919439","35.8833333","52.15","IR","35","`Abbasabad"
"1919417","35.9333333","52.95","IR","17","`Abbasabad"
"1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
"1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
"1919713","30.3705556","56.07","IR","29","`Abdolabad"
"1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
"1919716","35.6025","59.2322222","IR","30","`Abdolabad"
"1919714","34.2197222","56.5447222","IR","30","`Abdolabad"

其他细节:

  • PostgreSQL 8.4 数据库
  • Linux

问题

有些值是明显的重复值(“Abay”是因为区域匹配,而“Ab Barik”是因为这两个位置非常接近),有些值不那么明显(甚至可能不是实际的重复值):

"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"

目标是消除所有重复。

问题

给定一个值表,例如上述 CSV 数据:

  • 您将如何消除重复项?
  • 您会使用哪些以地理为中心的 PostgreSQL 函数?
  • 您还会使用哪些其他标准来欺骗重复项?

更新

半工作示例代码,用于在同一国家/地区内选择非常接近(10 公里内)的重复城市名称:

select
  c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
from
  climate.maxmind_city c1,
  climate.maxmind_city c2
where
  c1.country = 'BE' and
  c1.id <> c2.id and
  c1.country = c2.country and
  c1.name = c2.name and
  (c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
  earth_distance(
    ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
    ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
order by
  country, name

想法

两阶段方法:

  1. 通过删除 min(id) 来消除明显的重复项(相同的国家、地区和城市名称)。
  2. 消除彼此靠近且名称和国家相同的人。这可能会移除一些合法城市,但几乎不会产生任何后果。

谢谢!

【问题讨论】:

  • 考虑到“Ab Barik”,你怎么知道纬度和经度的值比地区的值更可靠?
  • @Catcall:我没有。但是,通过查看彼此靠近的同名城市,哪个被删除并不重要(出于我的目的)。一个问题是确定它们何时彼此相距足够远以被视为不同的城市。应该使用 PostgreSQL 提供的地理功能之一来比较距离。
  • 小心,堪萨斯州堪萨斯城和密苏里州堪萨斯城同名,在同一个国家,而且彼此非常非常接近。
  • @Mark:谢谢,马克。我把堪萨斯城 MO 放回去了。可能还有一些像这样的人已经被消灭了。不过,就我的目的而言,这不是一个大问题。

标签: sql postgresql geolocation country city


【解决方案1】:

查找重复项很简单:

select
  max(id) as this_should_stay,
  latitude,
  longitude,
  country,
  region,
  city
FROM
  your_table
group by
  latitude,
  longitude,
  country,
  region,
  city
having count(*) > 1;

在此基础上添加删除重复的代码很简单:

delete from your_table where id not in (
    select
      max(id) as this_should_stay
    FROM
      your_table
    group by
      latitude,
      longitude,
      country,
      region,
      city
)

注意在删除查询中缺少有。

【讨论】:

  • 谢谢。那不会只找到精确的重复项吗?从我所看到的数据来看,Abay 线具有不同的纬度和经度值。
  • 当然可以,但我认为从查询中删除 2 个字段是非常简单的操作 :)
  • 我认为我表达的问题不够好。一些重复是重复的,因为它们的纬度和经度非常接近,而不是精确。为了确定它们的接近度(例如,阿巴斯巴德),我需要使用一些地理距离函数来查找重复项。如果它们不够接近(使用设定的阈值),那么它们可能是不同的城市。我并不是真的在寻找 SQL 代码(正如您所指出的,代码很简单),但我在尝试删除重复项时可能会遇到问题。
【解决方案2】:

这将删除与同一国家/地区的同名城市非常接近的第二个城市:

delete from climate.maxmind_city mc where id in (
select
  max(c1.id)
from
  climate.maxmind_city c1,
  climate.maxmind_city c2
where
  c1.id <> c2.id and
  c1.country = c2.country and
  c1.name = c2.name and
  earth_distance(
    ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
    ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 35
group by
  c1.country, c1.name
order by
  c1.country, c1.name
)

【讨论】:

    【解决方案3】:

    如果您的数据是通过 CSV 文件和代码 (PHP) 导入的,那么您可以使用 PHP 代码中的放置条件防止重复输入。如果您插入的城市已经存在,则循环继续下一条记录并跳过当前记录。

    如果您按照这种方式将数据导入数据库,请尝试此操作..

    谢谢。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2013-01-21
      • 2011-03-23
      • 2016-12-13
      • 2015-11-09
      • 1970-01-01
      • 2016-08-29
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多