对haversine公式使用列表推导答案

【问题标题】：using list comprehensions for haversine formula对haversine公式使用列表推导
【发布时间】：2021-08-31 17:54:18
【问题描述】：

我有两个数据框。房屋位置之一和餐厅位置之一，其坐标均以纬度/经度为单位。我需要创建一个新列来计算它们之间的距离。例如，如果我有一个包含 5 个房屋位置的列表，则预期结果将是每个餐厅的 5 次距离计算（25 个值）。 df1 是位置，df2 是餐厅。

我的距离计算在这里，但我确实改变了几次：

版本 1：

def distance(location, restaurant): 
    lat1, lon1 = location
    lat2, lon2 = restaurant
    radius = 6371 *1000# km
    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c
    return d

版本 2：

def haversine(lat1, lon1, lat2, lon2):
    radius = 6371 
    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c
    return d

我试过写一个循环，但它返回'Series object is not callable'错误：

ll = [] 
for index,rows in df2.iterrows():
        lat1 = rows['Latitude']
        lon1 = rows['Longitude']
        for i,r in df1.iterrows():
                dist = distance((lat1,lon1),(r['Latitude'],r['Longitude']))
                ll.append(rows(float(dist)))

然后我尝试使用列表推导，两种不同的方式：

df1['result'] = df1.apply(lambda x: float(haversine(df1['Latitude'], df1['Longitude'], df2['Latitude'], df2['Longitude']), axis=1))

第一个返回错误'cannot convert the series to

第二个没有给我想要的结果：

Dist = []
for w, x, y, z in zip(df1['Latitude'], df2['Longitude'], df2['Latitude'], df2['Longitude']):
    Dist.extend([distance((w,x),(y,z))])
print(Dist)

output: [515.38848499753, 54.26312420254462, 10.563518031233743, 374.5045129388741, 451.6737920301973]

这样做的正确方法是什么？最终，我将不得不将其扩展到 10 万间房屋和 2480 家餐厅。很遗憾，我没有共享数据的权限。

【问题讨论】：

当你扩大规模时，结果将有 2.48 亿个条目。您可能应该找到一种优化它的方法 - 可能只是按街道或社区进行优化，而不是获取每个房子的距离。
ll.append(rows(float(dist))) 应该是 ll.append(dist)。您为什么要尝试将rows 用作函数？
@Barmar 我的错误，我可以删除那个错字。 & 我同意，但是，我不负责这个项目，我只是想完成对我的要求

标签： python pandas dataframe list-comprehension haversine

【解决方案1】：

您可以使用运行速度更快的矢量化操作，这是一个 sn-p，它采用两个维度为 nX2、mX2 的数组，即保存 n 和 m 个位置

import numpy as np
from sklearn.metrics.pairwise import haversine_distances


def haversine(locations1, locations2):
    locations1 = np.deg2rad(locations1)
    locations2 = np.deg2rad(locations2)
    return haversine_distances(locations1, locations2) * 6371000

使用您的尺寸，它会在 10 秒内在我的机器上运行

【讨论】：

我没有能力在我的电脑上导入这个包
这是一种更强大的解决方案，如果您计划在更大的数据帧上扩展此脚本，它可能具有更快的执行时间。如果可能，我建议您在虚拟环境中安装 scikit-learn。
@PreciXon 我刚刚获得使用此软件包的许可，并将继续使用此选项，但我仍然没有看到正确的输出，我觉得我错过了一些东西
尝试将>>>haversine_distances(locations1, locations2)更改为>>>haversine_distances([locations1, locations2])
@amanda 我刚刚尝试使用此代码，它给了我一个错误，说输入的尺寸有误。我很确定您的数据集中缺少值。

【解决方案2】：

您首先必须将两个数据框都转换为浮点数

df1 = df1.astype(float)
df2 = df2.astype(float)

您尝试的第一种方法现在应该可以了。如果没有，这里有一个 sn-p 可以

distances, empty_value_indexes = [], []
for i in range(len(df1['Latitude'])):
    try: 
        d = haversine(df1['Latitude'][i], df1['Longitude'][i], df2['Latitude'][i], df2['Longitude'][i])
        distances.append(d)
    except KeyError as e:
        print("Encounted KeyError in the {i}'th iteration, appending 0 to list")
        distances.append(0)
        empty_values_indexes.append(i)
    except Exception as e:
        print(d"Encountered a different error message - \n{str(e)}")
        


df1['results'] = results
print(f"Indexes of empty Values: {empty_values_indexes}")

【讨论】：

我喜欢您提供的第二个 sn-p，但它返回“关键错误：5”。我也得到了我的第一个方法，但它以列表的形式返回距离，我希望它作为 df1 中的另一列返回。建议？
sn-p >>>df1['results'] = results 中的最后一行应该以列表作为其值。至于关键错误，我没有遇到过。可能是您的数据框有空值。我将更新 sn-p 以处理此类区域。

【解决方案3】：

import pandas as pd
import numpy as np


# haversine formula
def haversine(lat1, lon1, lat2, lon2):
    radius = 6371
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    total_dist = radius * c
    return total_dist


df1 = pd.DataFrame({
    'lat':[-14.345234, -12.456345, -20.111111, -15.222222, -16.111111],
    'lon':[145.632423, 143.653535, 147.111111, 146.222222, 148.111111]
})

df2 = pd.DataFrame({
    'lat':[-14.345234, -12.456345, -20.111111, -15.222222, -16.111111],
    'lon':[145.632423, 143.653535, 147.111111, 146.222222, 148.111111]
})

# New column for df1, just a list of zeros
df1['distance'] = np.zeros(len(df1))

# Iterate over the rows of df1.
for index, row in df1.iterrows():
    # For each row in df1, iterate over the rows of df2.
    for index2, row2 in df2.iterrows():
        # Calculate the distance between each pair of locations.
        df1.loc[index, 'distance'] += haversine(row['lat'], row['lon'],
                                                row2['lat'], row2['lon'])
# New column for df2, just a list of zeros
df2['distance'] = np.zeros(len(df2))

# Iterate over the rows of df2.
for index, row in df2.iterrows():
    # For each row in df2, iterate over the rows of df1.
    for index2, row2 in df1.iterrows():
        # Calculate the distance between each pair of locations.
        df2.loc[index, 'distance'] += haversine(row['lat'], row['lon'],
                                                row2['lat'], row2['lon'])


"""
The above is a bit of a mess, but it does work. I am not sure how to convert it to a class method though.
"""

【讨论】：

这并没有达到目的。我需要从每个餐厅到每个位置的距离。即 5x5 = 25，而不是 5x2 = 10