【问题标题】:Calculating the minimum haversine distance for a set of coordinates计算一组坐标的最小半正弦距离
【发布时间】:2022-10-21 04:43:21
【问题描述】:

我正在尝试找到一种有效的方法来计算一组坐标(纬度,经度)到最近邻居的距离:

[[51.51045038114607, -0.1393407528617875],
[51.5084300350736, -0.1261805976142865],
[51.37912856172232, -0.1038613174724213]]

我以前有一个工作(我想!)一段代码,它使用 sklearn 的 NearestNeighbors 来降低此任务的算法复杂性:

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import haversine_distances
from math import sin, cos, sqrt, atan2, radians

# coordinates
coords = [[51.51045038114607, -0.1393407528617875],
          [51.5084300350736, -0.1261805976142865],
          [51.37912856172232, -0.1038613174724213]]

# tree method that reduces algorithmic complexity from O(n^2) to O(Nlog(N))
nbrs = NearestNeighbors(n_neighbors=2,
                        metric=_haversine_distance
                        ).fit(coords)

distances, indices = nbrs.kneighbors(coords)

# the outputted distances
result = distances[:, 1]

输出如下:

array([ 1.48095104,  1.48095104, 14.59484348])

它使用我自己版本的半正弦距离作为距离度量

def _haversine_distance(p1, p2):
"""
p1: array of two floats, the first point
p2: array of two floats, the second point

return: Returns a float value, the haversine distance

"""
lon1, lat1 = p1
lon2, lat2 = p2

# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

# get the deltas
dlon = lon2 - lon1
dlat = lat2 - lat1

# haversine formula
a = np.sin(dlat/2)**2 + (np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
c = 2 * np.arcsin(np.sqrt(a))

# approximate radius of earth in km
R = 6373.0

# convert to km distance
distance = R * c

return distance

These distances are wrong,我的第一个问题是,这是为什么呢?有什么办法可以在保留 NearestNeighbors 方法的算法简单性的同时纠正这个问题?

然后我发现我可以通过使用 geopy.distance 方法得到正确的答案,但是这并没有内置技术来降低复杂性和计算时间

import geopy.distance

coords_1 = (51.51045038, -0.13934075)
coords_2 = (51.50843004, -0.1261806)

geopy.distance.geodesic(coords_1, coords_2).km

我的第二个问题是,是否有这种方法的实现可以降低复杂性,否则我将被迫使用嵌套的 for 循环来检查每个之间的距离 点和所有其他人。

任何帮助表示赞赏!

相关问题 Vectorised Haversine formula with a pandas dataframe

【问题讨论】:

    标签: python scikit-learn geopy haversine


    【解决方案1】:

    根据数据集的大小,将数据转换为 Pandas 数据框可能更有效。

    import pandas as pd
    import numpy as np
    
    def haversine(lon1, lat1, lon2, lat2):
        lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
    
        haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    
        r = 6371 #6371 for distance in KM for miles use 3958.756
        dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
        return pd.Series(dist)
    
    #added random id number
    df = pd.DataFrame({'id':[123,456,789],
                       'lat':[51.51045038114607, 51.5084300350736, 51.37912856172232],
                       'lon':[-0.1393407528617875, -0.1261805976142865, -0.1038613174724213]})
    
    >>> df
        id        lat       lon
    0  123  51.510450 -0.139341
    1  456  51.508430 -0.126181
    2  789  51.379129 -0.103861
    
    #self merging (faster than iterating through rows in larger datasets)
    df2 = pd.merge(df.assign(key=1),df.assign(key=1), on='key', suffixes=('', '_2')).drop('key', axis=1)
    
    
    >>> df2
        id        lat       lon  id_2      lat_2     lon_2
    0  123  51.510450 -0.139341   123  51.510450 -0.139341
    1  123  51.510450 -0.139341   456  51.508430 -0.126181
    2  123  51.510450 -0.139341   789  51.379129 -0.103861
    3  456  51.508430 -0.126181   123  51.510450 -0.139341
    4  456  51.508430 -0.126181   456  51.508430 -0.126181
    5  456  51.508430 -0.126181   789  51.379129 -0.103861
    6  789  51.379129 -0.103861   123  51.510450 -0.139341
    7  789  51.379129 -0.103861   456  51.508430 -0.126181
    8  789  51.379129 -0.103861   789  51.379129 -0.103861
    
    #drop duplicates
    df2 = df2[df2['id']!=df2['id_2']].reset_index(drop=True)
    
    >>> df2
        id        lat       lon  id_2      lat_2     lon_2
    0  123  51.510450 -0.139341   456  51.508430 -0.126181
    1  123  51.510450 -0.139341   789  51.379129 -0.103861
    2  456  51.508430 -0.126181   123  51.510450 -0.139341
    3  456  51.508430 -0.126181   789  51.379129 -0.103861
    4  789  51.379129 -0.103861   123  51.510450 -0.139341
    5  789  51.379129 -0.103861   456  51.508430 -0.126181
    
    #find distance
    df2['dist'] = haversine(df2['lon'], df2['lat'], df2['lon_2'], df2['lat_2'])
    
    >>> df2
        id        lat       lon  id_2      lat_2     lon_2       dist
    0  123  51.510450 -0.139341   456  51.508430 -0.126181   0.938061
    1  123  51.510450 -0.139341   789  51.379129 -0.103861  14.807897
    2  456  51.508430 -0.126181   123  51.510450 -0.139341   0.938061
    3  456  51.508430 -0.126181   789  51.379129 -0.103861  14.460639
    4  789  51.379129 -0.103861   123  51.510450 -0.139341  14.807897
    5  789  51.379129 -0.103861   456  51.508430 -0.126181  14.460639
    
    #closest neighbor
    >>> df2[['id', 'lat', 'lon', 'dist']].sort_values(['id', 'dist']).groupby('id').first().reset_index()
        id        lat       lon       dist
    0  123  51.510450 -0.139341   0.938061
    1  456  51.508430 -0.126181   0.938061
    2  789  51.379129 -0.103861  14.460639
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-04-05
      • 2014-03-10
      • 1970-01-01
      • 2021-04-05
      • 2020-03-03
      • 2022-10-21
      • 2022-07-19
      相关资源
      最近更新 更多