Python：按时间和空间分组答案

【问题标题】：Python: group by time and spacePython：按时间和空间分组
【发布时间】：2023-03-03 22:49:01
【问题描述】：

    ID  timestamp                lat          lon
0   A   2020-03-20 00:17:10 42.360000   -71.090000
1   A   2020-03-20 00:20:51 42.360000   -71.090000
2   A   2020-03-20 00:35:31 42.360000   -71.090000
3   A   2020-03-20 00:35:34 42.360000   -71.090000
4   B   2020-03-20 01:48:14 42.360000   -71.100000
5   B   2020-03-20 03:15:00 42.360000   -71.100000
6   C   2020-03-20 11:05:47 42.365259   -71.103502
7   D   2020-03-20 10:53:43 42.363174   -71.096756
8   D   2020-03-20 10:57:45 42.363260   -71.095598
9   D   2020-03-20 11:04:24 42.363303   -71.094997

我想看看在一天中的任何时间，两个用户之间在100 米半径内是否有重叠至少10 秒。我想有如下输出

df
      usuerI     userJ     centroid.lat  centroid.lon     time
0      A          B         42.360000      -71.094997      33s
1      B          D         42.365259      -71.103502      5s

【问题讨论】：

标签： python pandas geopandas haversine

【解决方案1】：

我添加了一些额外的行，以使您的数据在此示例中更好地工作。值得注意的是，这种方法需要大量的优化和错误处理才能很好地扩展。

   ID  timestamp                lat          lon
0   A   2020-03-20 00:17:10 42.360000   -71.090000
1   A   2020-03-20 00:20:51 42.360000   -71.090000
2   A   2020-03-20 00:35:31 42.360000   -71.090000
3   A   2020-03-20 00:35:34 42.360000   -71.090000
4   B   2020-03-20 01:48:14 42.360000   -71.100000
5   B   2020-03-20 03:15:00 42.360000   -71.100000
6   C   2020-03-20 11:05:47 42.365259   -71.103502
7   D   2020-03-20 10:53:43 42.363174   -71.096756
8   D   2020-03-20 10:57:45 42.363260   -71.095598
9   D   2020-03-20 11:04:24 42.363303   -71.094997
10  E   2020-03-20 00:35:33 42.360001   -71.090001
11  F   2020-03-20 01:48:15 42.360003   -71.100099

接下来我们需要稍微调整一下 df。

import pandas as pd
import datetime
import numpy as np
from scipy import spatial

df = pd.read_clipboard(sep=r"[ ]{2,}")
df['lat_fix'] = df['lat'].str[-10:]
df['time'] = df['lat'].str[0:19] 
df['ID'] = df['timestamp']
df['lat'] = df['lat_fix']
df = df[['ID', 'time', 'lat', 'lon']]
df['lat'] = pd.to_numeric(df['lat'])
df['time'] = pd.to_datetime(df['time'])
df['idx'] = range(0, df.shape[0])
df.set_index('time', inplace=True)

然后我们找到距离阈值内的点。 distance_thresh_list 存储一个列表列表，其中子列表包含每组点的 idx 值，它们之间的距离小于 ~100m。

x, y = df['lon'], df['lat'] 
points = np.array(list(zip(x.ravel(), y.ravel())))
tree = spatial.cKDTree(points)

distance_thresh_list = []
for p in points:
#0.0009 in decimal degrees is very close to 100m
    x = tree.query_ball_point(p, 0.0009)
    if len(x) > 1 and x not in distance_thresh_list:
        distance_thresh_list.append(x)

然后我们寻找唯一的 ID。

spatial_matches_list = []
df_spatial_match_list = []

for i in distance_thresh_list:
    df_slice = df[df['idx'].isin(i)]
    uniq_id_list = df_slice.ID.unique().tolist()
    if len(uniq_id_list) > 1 and uniq_id_list not in spatial_matches_list:
        print(uniq_id_list)
        spatial_matches_list.append(uniq_id_list)

        df_spatial_match = df[df['ID'].isin(uniq_id_list)]

        df_spatial_match = df[df['idx'].isin(i)]
        print(i)
        print(df_spatial_match)

        df_spatial_match_list.append(df_spatial_match)

最后，我们寻找时间匹配。

for df in df_spatial_match_list:
    for idx, row in df.iterrows():
        before_window = idx + datetime.timedelta(seconds=-10)

        after_window = idx + datetime.timedelta(seconds=10)

        df_spatial_match_slice = df[(df.index.get_level_values(0) >= before_window) & (df.index.get_level_values(0) <= after_window)]

        if len(df_spatial_match_slice['ID'].unique().tolist()) > 1:
            print(df_spatial_match_slice)

这是匹配项（有重复项）。

                    ID        lat        lon  idx
time                                             
2020-03-20 00:35:31  A  42.360000 -71.090000    2
2020-03-20 00:35:34  A  42.360000 -71.090000    3
2020-03-20 00:35:33  E  42.360001 -71.090001   10
                    ID        lat        lon  idx
time                                             
2020-03-20 00:35:31  A  42.360000 -71.090000    2
2020-03-20 00:35:34  A  42.360000 -71.090000    3
2020-03-20 00:35:33  E  42.360001 -71.090001   10
                    ID        lat        lon  idx
time                                             
2020-03-20 00:35:31  A  42.360000 -71.090000    2
2020-03-20 00:35:34  A  42.360000 -71.090000    3
2020-03-20 00:35:33  E  42.360001 -71.090001   10
                    ID        lat        lon  idx
time                                             
2020-03-20 01:48:14  B  42.360000 -71.100000    4
2020-03-20 01:48:15  F  42.360003 -71.100099   11
                    ID        lat        lon  idx
time                                             
2020-03-20 01:48:14  B  42.360000 -71.100000    4
2020-03-20 01:48:15  F  42.360003 -71.100099   11

因此，上面的代码仅查看 ID 是否在一段时间内彼此靠近。如果我们想计算 ID 彼此接近的时间，我们可以这样做。

id_min_max_dict = {}

for i in df_spatial_match_slice_list:
    for j in i.ID.unique().tolist():

        id_slice = i.loc[i['ID'] == j]
        id_slice_time_max = id_slice.index.max()
        id_slice_time_min = id_slice.index.min()

        id_min_max_dict[j] = [id_slice_time_min, id_slice_time_max]

一旦我们有了一个 dict 来存储时间范围，我们就可以看到同一位置的 ID 之间有多少共享秒数。

for i in spatial_matches_list:
    print(i)
    time_range1 = pd.date_range(id_min_max_dict[i[0]][0], id_min_max_dict[i[0]][1], freq='S') 
    time_range2 = pd.date_range(id_min_max_dict[i[1]][0], id_min_max_dict[i[1]][1], freq='S')


    time_range_intersection = time_range1.intersection(time_range2)
    print(time_range_intersection)
    print(str(len(time_range_intersection)) + ' seconds of time within ~100m')

所以时间/位置交叉点看起来像这样。 FWIW，如果没有更多的样本数据行，这并不是很令人兴奋，而且这种方法需要额外的复杂性才能处理超过 2 个唯一 ID。

['A', 'E']
DatetimeIndex(['2020-03-20 00:35:33'], dtype='datetime64[ns]', freq=None)
1 seconds of time within ~100m
['B', 'F']
DatetimeIndex([], dtype='datetime64[ns]', freq=None)
0 seconds of time within ~100m

【讨论】：

【解决方案2】：

我不知道你尝试了什么，但你可以像这样开始。我没有考虑 10 秒的时间，但很容易添加。我用geopy.distance.distance 来测量距离。下面的代码将遭遇存储在一个列表中，您可以从中轻松构建新的数据框。

import numpy as np
import geopy.distance

# threshold distance in km
threshold_distance = 0.1

# list of IDs
id_list = list(df.index.levels[1])

# combinations of IDs
combs = list(combinations(id_list, 2))

# list to store the indices of the meetings
meetings = []

# go through combinations
for i, j in combs:

    # get the indices (numbers) of both IDs
    i_indices = [a[0] for a in df.iloc[df.index.get_level_values(1) == i].index.values]
    j_indices = [a[0] for a in df.iloc[df.index.get_level_values(1) == j].index.values]


    # go through the ID's data
    for i_index in i_indices:
        for j_index in j_indices:
                # if the date coincides
                if df.at[(i_index, i), "date"]!=df.at[(j_index, j), "date"]:
                    continue

                # use geopy to calculate the distance from the coordinates
                coords1 = (df.at[(i_index,i),"lat"],df.at[(i_index, i),"lon"])
                coords2 = (df.at[(j_index,j),"lat"],df.at[(j_index, j),"lon"])
                if geopy.distance.distance(coords1, coords2).km < threshold_distance:
                    meetings.append((i_index, j_index))

【讨论】：