Pandas：计算每组行内的半正弦距离答案

【问题标题】：Pandas: calculate haversine distance within each group of rowsPandas：计算每组行内的半正弦距离
【发布时间】：2017-04-23 22:14:22
【问题描述】：

示例 CSV 如下所示：

 user_id  lat         lon
    1   19.111841   72.910729
    1   19.111342   72.908387
    2   19.111542   72.907387
    2   19.137815   72.914085
    2   19.119677   72.905081
    2   19.129677   72.905081
    3   19.319677   72.905081
    3   19.120217   72.907121
    4   19.420217   72.807121
    4   19.520217   73.307121
    5   19.319677   72.905081
    5   19.419677   72.805081
    5   19.629677   72.705081
    5   19.111860   72.911347
    5   19.111860   72.931346
    5   19.219677   72.605081
    6   19.319677   72.805082
    6   19.419677   72.905086

我知道我可以使用 haversine 进行距离计算（并且 python 也有 hasrsine 包）：

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees).
    Source: http://gis.stackexchange.com/a/56589/15183
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a)) 
    km = 6371 * c
    return km

但是，我只想计算 same id 内的距离。所以预期的答案是这样的：

user_id  lat         lon    result
    1   19.111841   72.910729   NaN
    1   19.111342   72.908387   xx*
    2   19.111542   72.907387   NaN
    2   19.137815   72.914085   xx
    2   19.119677   72.905081   xx
    2   19.129677   72.905081   xx
    3   19.319677   72.905081   NaN
    3   19.120217   72.907121   xx
    4   19.420217   72.807121   NaN
    4   19.520217   73.307121   xx
    5   19.319677   72.905081   NaN
    5   19.419677   72.805081   xx
    5   19.629677   72.705081   xx
    5   19.111860   72.911347   xx
    5   19.111860   72.931346   xx
    5   19.219677   72.605081   xx
    6   19.319677   72.805082   NaN
    6   19.419677   72.905086   xx

*: xx 是以公里为单位的距离数字。

我该怎么做？

PSI am using pandas

【问题讨论】：

为什么有 4 个条目的 id 相同但值重复？你如何计算 4 个条目之间的距离？
您已经知道如何获取距离，您的问题似乎更多是关于对您的数据进行 group 的热点。这是正确的吗？
您应该更改问题的标题，因为它与距离计算无关。此外，@EyuelDK 提出的问题仍未得到答复。您有两个以上具有相同 ID 的元素，您希望如何获得所有元素的距离？在所有可能的组合之间？相邻元素之间？
@Gabriel，你为什么要删除pandas标签？
不，但我希望用python 标记的问题在某种程度上与它相关。这个问题显然与csv 相关，而它与pandas 完全无关（除了在此处未发布的一些代码中显然使用pandas 的OP）我不会删除haversine 标签，但是它与pandas 标签一样无关紧要。我想如果你mus在这里没有pandas，那就这样吧。干杯。

标签： python csv pandas gis distance

【解决方案1】：

试试这个方法：

import pandas as pd
import numpy as np

# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')

# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

现在我们可以计算属于同一id（组）的坐标之间的距离：

df['dist'] = \
    np.concatenate(df.groupby('id')
                     .apply(lambda x: haversine(x['lat'], x['lon'],
                                                x['lat'].shift(), x['lon'].shift())).values)

结果：

In [105]: df
Out[105]:
    id        lat        lon       dist
0    1  19.111841  72.910729        NaN
1    1  19.111342  72.908387   0.252243
2    2  19.111542  72.907387        NaN
3    2  19.137815  72.914085   3.004976
4    2  19.119677  72.905081   2.227658
5    2  19.129677  72.905081   1.111949
6    3  19.319677  72.905081        NaN
7    3  19.120217  72.907121  22.179974
8    4  19.420217  72.807121        NaN
9    4  19.520217  73.307121  53.584504
10   5  19.319677  72.905081        NaN
11   5  19.419677  72.805081  15.286775
12   5  19.629677  72.705081  25.594890
13   5  19.111860  72.911347  61.509917
14   5  19.111860  72.931346   2.101215
15   5  19.219677  72.605081  36.304756
16   6  19.319677  72.805082        NaN
17   6  19.419677  72.905086  15.287063

【讨论】：

【解决方案2】：

您只需要一个工作数据结构、列表字典和 lat/lon 作为元组。快速原型化它可能看起来像这样：

from haversine import haversine  # pip3 install haversine
from collections import defaultdict

csv = """
1   19.111841   72.910729
1   19.111342   72.908387
2   19.111342   72.908387
2   19.137815   72.914085
2   19.119677   72.905081
2   19.119677   72.905081
3   19.119677   72.905081
3   19.120217   72.907121
5   19.119677   72.905081
5   19.119677   72.905081
5   19.119677   72.905081
5   19.111860   72.911346
5   19.111860   72.911346
5   19.119677   72.905081
6   19.119677   72.905081
6   19.119677   72.905081
"""

d = defaultdict(list)  # data structure !

for line in csv.splitlines():
    line = line.strip()  # remove whitespaces

    if not line:
        continue  # skip empty lines

    cId, lat, lon = line.split('   ')
    d[cId].append((float(lat), float(lon)))

for k, v in d.items():
    print ('Distance for id: ', k, haversine(v[0], v[1]))

Distance for id:  1 0.2522433072207346
Distance for id:  2 3.0039140173887557
Distance for id:  3 0.22257643412844885
Distance for id:  5 0.0
Distance for id:  6 0.0

【讨论】：

【解决方案3】：

假设您想用每个用户 id 组中的第一个元素与组中的所有其他条目计算 haversine()，这种方法将有效：

# copying example data from OP
import pandas as pd
df = pd.read_clipboard() # alternately, df = pd.read_csv(filename)

def haversine_wrapper(row):
    # return None when both lon/lat pairs are the same
    if (row['first_lon'] == row['lon']) & (row['first_lat'] == row['lat']):
        return None
    return haversine(row['first_lon'], row['first_lat'], row['lon'], row['lat'])

df['result'] = (df.merge(df.groupby('user_id', as_index=False)
                           .agg({'lat':'first','lon':'first'})
                           .rename(columns={'lat':'first_lat','lon':'first_lon'}), 
                         on='user_id')
                  .apply(haversine_wrapper, axis='columns'))

print(df)

输出：

user_id        lat        lon     result
 0    1  19.111841  72.910729        NaN
 1    1  19.111342  72.908387   0.252243
 2    2  19.111542  72.907387        NaN
 3    2  19.137815  72.914085   3.004976
 4    2  19.119677  72.905081   0.936454
 5    2  19.129677  72.905081   2.031021
 6    3  19.319677  72.905081        NaN
 7    3  19.120217  72.907121  22.179974
 8    4  19.420217  72.807121        NaN
 9    4  19.520217  73.307121  53.584504
 10   5  19.319677  72.905081        NaN
 11   5  19.419677  72.805081  15.286775
 12   5  19.629677  72.705081  40.346128
 13   5  19.111860  72.911347  23.117560
 14   5  19.111860  72.931346  23.272178
 15   5  19.219677  72.605081  33.395165
 16   6  19.319677  72.805082        NaN
 17   6  19.419677  72.905086  15.287063

【讨论】：

【解决方案4】：

这应该与您的示例输入和输出完全一样。

脚本

import csv
from haversine import haversine

with open('file.csv') as file:

    reader = csv.reader(file)
    next(reader) # skip header
    previous_row = (None, None, None)
    for id, lon, lat in reader:

        id, lon, lat = int(id), float(lon), float(lat)
        current_row = id, lon, lat
        distance = float('nan')

        if current_row[0] == previous_row[0]:
            distance = haversine(previous_row[1:], current_row[1:])

        print('{} {:02.7f} {:02.7f} {:02.7f}'.format(*current_row, distance))
        previous_row = current_row

输出

1 19.1118410 72.9107290 nan
1 19.1113420 72.9083870 0.2522433
2 19.1115420 72.9073870 nan
2 19.1378150 72.9140850 3.0049762
2 19.1196770 72.9050810 2.2276576
2 19.1296770 72.9050810 1.1119493
3 19.3196770 72.9050810 nan
3 19.1202170 72.9071210 22.1799743
4 19.4202170 72.8071210 nan
4 19.5202170 73.3071210 53.5845041
5 19.3196770 72.9050810 nan
5 19.4196770 72.8050810 15.2867753
5 19.6296770 72.7050810 25.5948897
5 19.1118600 72.9113470 61.5099175
5 19.1118600 72.9313460 2.1012148
5 19.2196770 72.6050810 36.3047557
6 19.3196770 72.8050820 nan
6 19.4196770 72.9050860 15.2870632

【讨论】：