【问题标题】:Python: group by time and spacePython:按时间和空间分组
【发布时间】:2023-03-03 22:49:01
【问题描述】:
    ID  timestamp                lat          lon
0   A   2020-03-20 00:17:10 42.360000   -71.090000
1   A   2020-03-20 00:20:51 42.360000   -71.090000
2   A   2020-03-20 00:35:31 42.360000   -71.090000
3   A   2020-03-20 00:35:34 42.360000   -71.090000
4   B   2020-03-20 01:48:14 42.360000   -71.100000
5   B   2020-03-20 03:15:00 42.360000   -71.100000
6   C   2020-03-20 11:05:47 42.365259   -71.103502
7   D   2020-03-20 10:53:43 42.363174   -71.096756
8   D   2020-03-20 10:57:45 42.363260   -71.095598
9   D   2020-03-20 11:04:24 42.363303   -71.094997

我想看看在一天中的任何时间,两个用户之间在100 米半径内是否有重叠至少10 秒。我想有如下输出

df
      usuerI     userJ     centroid.lat  centroid.lon     time
0      A          B         42.360000      -71.094997      33s
1      B          D         42.365259      -71.103502      5s

【问题讨论】:

    标签: python pandas geopandas haversine


    【解决方案1】:

    我添加了一些额外的行,以使您的数据在此示例中更好地工作。值得注意的是,这种方法需要大量的优化和错误处理才能很好地扩展。

       ID  timestamp                lat          lon
    0   A   2020-03-20 00:17:10 42.360000   -71.090000
    1   A   2020-03-20 00:20:51 42.360000   -71.090000
    2   A   2020-03-20 00:35:31 42.360000   -71.090000
    3   A   2020-03-20 00:35:34 42.360000   -71.090000
    4   B   2020-03-20 01:48:14 42.360000   -71.100000
    5   B   2020-03-20 03:15:00 42.360000   -71.100000
    6   C   2020-03-20 11:05:47 42.365259   -71.103502
    7   D   2020-03-20 10:53:43 42.363174   -71.096756
    8   D   2020-03-20 10:57:45 42.363260   -71.095598
    9   D   2020-03-20 11:04:24 42.363303   -71.094997
    10  E   2020-03-20 00:35:33 42.360001   -71.090001
    11  F   2020-03-20 01:48:15 42.360003   -71.100099
    

    接下来我们需要稍微调整一下 df。

    import pandas as pd
    import datetime
    import numpy as np
    from scipy import spatial
    
    df = pd.read_clipboard(sep=r"[ ]{2,}")
    df['lat_fix'] = df['lat'].str[-10:]
    df['time'] = df['lat'].str[0:19] 
    df['ID'] = df['timestamp']
    df['lat'] = df['lat_fix']
    df = df[['ID', 'time', 'lat', 'lon']]
    df['lat'] = pd.to_numeric(df['lat'])
    df['time'] = pd.to_datetime(df['time'])
    df['idx'] = range(0, df.shape[0])
    df.set_index('time', inplace=True)
    

    然后我们找到距离阈值内的点。 distance_thresh_list 存储一个列表列表,其中子列表包含每组点的 idx 值,它们之间的距离小于 ~100m。

    x, y = df['lon'], df['lat'] 
    points = np.array(list(zip(x.ravel(), y.ravel())))
    tree = spatial.cKDTree(points)
    
    distance_thresh_list = []
    for p in points:
    #0.0009 in decimal degrees is very close to 100m
        x = tree.query_ball_point(p, 0.0009)
        if len(x) > 1 and x not in distance_thresh_list:
            distance_thresh_list.append(x)
    

    然后我们寻找唯一的 ID。

    spatial_matches_list = []
    df_spatial_match_list = []
    
    for i in distance_thresh_list:
        df_slice = df[df['idx'].isin(i)]
        uniq_id_list = df_slice.ID.unique().tolist()
        if len(uniq_id_list) > 1 and uniq_id_list not in spatial_matches_list:
            print(uniq_id_list)
            spatial_matches_list.append(uniq_id_list)
    
            df_spatial_match = df[df['ID'].isin(uniq_id_list)]
    
            df_spatial_match = df[df['idx'].isin(i)]
            print(i)
            print(df_spatial_match)
    
            df_spatial_match_list.append(df_spatial_match)
    

    最后,我们寻找时间匹配。

    for df in df_spatial_match_list:
        for idx, row in df.iterrows():
            before_window = idx + datetime.timedelta(seconds=-10)
    
            after_window = idx + datetime.timedelta(seconds=10)
    
            df_spatial_match_slice = df[(df.index.get_level_values(0) >= before_window) & (df.index.get_level_values(0) <= after_window)]
    
            if len(df_spatial_match_slice['ID'].unique().tolist()) > 1:
                print(df_spatial_match_slice)
    

    这是匹配项(有重复项)。

                        ID        lat        lon  idx
    time                                             
    2020-03-20 00:35:31  A  42.360000 -71.090000    2
    2020-03-20 00:35:34  A  42.360000 -71.090000    3
    2020-03-20 00:35:33  E  42.360001 -71.090001   10
                        ID        lat        lon  idx
    time                                             
    2020-03-20 00:35:31  A  42.360000 -71.090000    2
    2020-03-20 00:35:34  A  42.360000 -71.090000    3
    2020-03-20 00:35:33  E  42.360001 -71.090001   10
                        ID        lat        lon  idx
    time                                             
    2020-03-20 00:35:31  A  42.360000 -71.090000    2
    2020-03-20 00:35:34  A  42.360000 -71.090000    3
    2020-03-20 00:35:33  E  42.360001 -71.090001   10
                        ID        lat        lon  idx
    time                                             
    2020-03-20 01:48:14  B  42.360000 -71.100000    4
    2020-03-20 01:48:15  F  42.360003 -71.100099   11
                        ID        lat        lon  idx
    time                                             
    2020-03-20 01:48:14  B  42.360000 -71.100000    4
    2020-03-20 01:48:15  F  42.360003 -71.100099   11
    

    因此,上面的代码仅查看 ID 是否在一段时间内彼此靠近。如果我们想计算 ID 彼此接近的时间,我们可以这样做。

    id_min_max_dict = {}
    
    for i in df_spatial_match_slice_list:
        for j in i.ID.unique().tolist():
    
            id_slice = i.loc[i['ID'] == j]
            id_slice_time_max = id_slice.index.max()
            id_slice_time_min = id_slice.index.min()
    
            id_min_max_dict[j] = [id_slice_time_min, id_slice_time_max]
    

    一旦我们有了一个 dict 来存储时间范围,我们就可以看到同一位置的 ID 之间有多少共享秒数。

    for i in spatial_matches_list:
        print(i)
        time_range1 = pd.date_range(id_min_max_dict[i[0]][0], id_min_max_dict[i[0]][1], freq='S') 
        time_range2 = pd.date_range(id_min_max_dict[i[1]][0], id_min_max_dict[i[1]][1], freq='S')
    
    
        time_range_intersection = time_range1.intersection(time_range2)
        print(time_range_intersection)
        print(str(len(time_range_intersection)) + ' seconds of time within ~100m')
    

    所以时间/位置交叉点看起来像这样。 FWIW,如果没有更多的样本数据行,这并不是很令人兴奋,而且这种方法需要额外的复杂性才能处理超过 2 个唯一 ID。

    ['A', 'E']
    DatetimeIndex(['2020-03-20 00:35:33'], dtype='datetime64[ns]', freq=None)
    1 seconds of time within ~100m
    ['B', 'F']
    DatetimeIndex([], dtype='datetime64[ns]', freq=None)
    0 seconds of time within ~100m
    

    【讨论】:

      【解决方案2】:

      我不知道你尝试了什么,但你可以像这样开始。 我没有考虑 10 秒的时间,但很容易添加。我用geopy.distance.distance 来测量距离。下面的代码将遭遇存储在一个列表中,您可以从中轻松构建新的数据框。

      import numpy as np
      import geopy.distance
      
      # threshold distance in km
      threshold_distance = 0.1
      
      # list of IDs
      id_list = list(df.index.levels[1])
      
      # combinations of IDs
      combs = list(combinations(id_list, 2))
      
      # list to store the indices of the meetings
      meetings = []
      
      # go through combinations
      for i, j in combs:
      
          # get the indices (numbers) of both IDs
          i_indices = [a[0] for a in df.iloc[df.index.get_level_values(1) == i].index.values]
          j_indices = [a[0] for a in df.iloc[df.index.get_level_values(1) == j].index.values]
      
      
          # go through the ID's data
          for i_index in i_indices:
              for j_index in j_indices:
                      # if the date coincides
                      if df.at[(i_index, i), "date"]!=df.at[(j_index, j), "date"]:
                          continue
      
                      # use geopy to calculate the distance from the coordinates
                      coords1 = (df.at[(i_index,i),"lat"],df.at[(i_index, i),"lon"])
                      coords2 = (df.at[(j_index,j),"lat"],df.at[(j_index, j),"lon"])
                      if geopy.distance.distance(coords1, coords2).km < threshold_distance:
                          meetings.append((i_index, j_index))
      

      【讨论】:

        猜你喜欢
        • 2018-10-12
        • 2021-03-06
        • 2011-12-20
        • 1970-01-01
        • 2014-07-24
        • 2018-02-12
        • 2013-01-24
        • 1970-01-01
        • 2017-11-02
        相关资源
        最近更新 更多