【问题标题】:Calculate two dimensional pairwise distance on a large numpy three dimensional array在大型 numpy 三维数组上计算二维成对距离
【发布时间】:2016-02-15 13:13:32
【问题描述】:

我有一个包含 300 万个点的 numpy 数组,格式为 [pt_id, x, y, z]。目标是返回所有具有欧几里得距离两个数字min_dmax_d 的点对。

欧几里得距离在xy 之间,而不是在z 上。但是,我想保留具有pt_id_frompt_id_todistance 属性的数组。

我正在使用 scipy 的 dist 来计算距离:

import scipy.spatial.distance
coords_arr = np.array([['pt1', 2452130.000, 7278106.000, 25.000],
                       ['pt2', 2479539.000, 7287455.000, 4.900],
                       ['pt3', 2479626.000, 7287458.000, 10.000],
                       ['pt4', 2484097.000, 7292784.000, 8.800],
                       ['pt5', 2484106.000, 7293079.000, 7.300],
                       ['pt6', 2484095.000, 7292891.000, 11.100]])

dists = scipy.spatial.distance.pdist(coords_arr[:,1:3], 'euclidean')
np.savetxt('test.out', scipy.spatial.distance.squareform(dists), delimiter=',')

如何返回表单数组:[pt_id_from, pt_id_to, distance]

【问题讨论】:

  • 您能否也为示例案例添加预期输出?
  • @Divakar 我修复了输出格式,所以 `['pt1', 'pt2', distance_as_number]

标签: python-2.7 numpy scipy pdist


【解决方案1】:

您只需通过循环所有可能的组合从数据中创建一个新数组。 itertools 模块非常适合这一点。

n = coords_arr.shape[0] # number of points
D = scipy.spatial.distance.squareform(dists) # distance matrix

data = []
for i, j in itertools.combinations(range(n), 2):
    pt_a = coords_arr[i, 0]
    pt_b = coords_arr[j, 0]
    d_ab = D[i,j]
    data.append([pt_a, pt_b, d_ab])

result_arr = np.array(data)

如果内存有问题,您可能希望将距离查找从使用巨大矩阵 D 更改为使用 ij 索引直接在 dists 中查找值。

【讨论】:

    【解决方案2】:

    嗯,['pt1', 'pt2', distance_as_number] 是不可能的。使用混合数据类型可以获得的最接近的是结构化数组,但是您不能执行result[:2,0] 之类的操作。您必须分别索引字段名称和数组索引,例如:result[['a','b']][0]

    这是我的解决方案:

    import numpy as np
    import scipy.spatial.distance
    
    coords_arr = np.array([['pt1', 2452130.000, 7278106.000, 25.000],
                           ['pt2', 2479539.000, 7287455.000, 4.900],
                           ['pt3', 2479626.000, 7287458.000, 10.000],
                           ['pt4', 2484097.000, 7292784.000, 8.800],
                           ['pt5', 2484106.000, 7293079.000, 7.300],
                           ['pt6', 2484095.000, 7292891.000, 11.100]])
    
    dists = scipy.spatial.distance.pdist(coords_arr[:,1:3], 'euclidean')
    
    # Create a shortcut for `coords_arr.shape[0]` which is basically
    # the total amount of points, hence `n`
    n = coords_arr.shape[0]
    
    # `a` and `b` contain the indices of the points which were used to compute the
    # distances in dists. In this example:
    # a = [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
    # b = [1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5]
    a = np.arange(n).repeat(np.arange(n-1, -1, -1))
    b = np.hstack([range(x, n) for x in xrange(1, n)])
    
    min_d = 1000
    max_d = 10000
    
    # Find out which distances are in range.
    in_range = np.less_equal(min_d, dists) & np.less_equal(dists, max_d)
    
    # Define the datatype of the structured array which will be the result.
    dtype = [('a', '<f8', (3,)), ('b', '<f8', (3,)), ('dist', '<f8')]
    
    # Create an empty array. We fill it later because it makes the code cleaner.
    # Its size is given by the sum over `in_range` which is possible
    # since True and False are equivalent to 1 and 0.
    result = np.empty(np.sum(in_range), dtype=dtype)
    
    # Fill the resulting array.
    result['a'] = coords_arr[a[in_range], 1:4]
    result['b'] = coords_arr[b[in_range], 1:4]
    result['dist'] = dists[in_range]
    
    print(result)
    
    # In caste you don't want a structured array at all, this is what you can do:
    result = np.hstack([coords_arr[a[in_range],1:],
                        coords_arr[b[in_range],1:],
                        dists[in_range, None]]).astype('<f8')
    print(result)
    

    结构化数组:

    [([2479539.0, 7287455.0, 4.9], [2484097.0, 7292784.0, 8.8], 7012.389393067102)
     ([2479539.0, 7287455.0, 4.9], [2484106.0, 7293079.0, 7.3], 7244.7819152821985)
     ([2479539.0, 7287455.0, 4.9], [2484095.0, 7292891.0, 11.1], 7092.75912462844)
     ([2479626.0, 7287458.0, 10.0], [2484097.0, 7292784.0, 8.8], 6953.856268287403)
     ([2479626.0, 7287458.0, 10.0], [2484106.0, 7293079.0, 7.3], 7187.909362255481)
     ([2479626.0, 7287458.0, 10.0], [2484095.0, 7292891.0, 11.1], 7034.873843929257)]
    

    ndarray:

    [[2479539.0, 7287455.0, 4.9, 2484097.0, 7292784.0, 8.8, 7012.3893],
     [2479539.0, 7287455.0, 4.9, 2484106.0, 7293079.0, 7.3, 7244.7819],
     [2479539.0, 7287455.0, 4.9, 2484095.0, 7292891.0, 11.1, 7092.7591],
     [2479626.0, 7287458.0, 10.0, 2484097.0, 7292784.0, 8.8, 6953.8562],
     [2479626.0, 7287458.0, 10.0, 2484106.0, 7293079.0, 7.3, 7187.9093],
     [2479626.0, 7287458.0, 10.0, 2484095.0, 7292891.0, 11.1, 7034.8738]]
    

    【讨论】:

    • 感谢您的精彩回答。我认为经过进一步检查,我希望答案为 (x_2, y_2, z_2), (x_4, y_4, z_4), (dist_2, 4)
    • @dassouki done :) 虽然我不确定你是否还想要结构化数组,所以我提供了两种输出格式。
    【解决方案3】:

    您可以使用np.where 获取范围内的距离坐标,然后以您的格式生成一个新列表,过滤相同的对。像这样:

    >>> import scipy.spatial.distance
    >>> import numpy as np
    >>> coords_arr = np.array([['pt1', 2452130.000, 7278106.000, 25.000],
    ...                        ['pt2', 2479539.000, 7287455.000, 4.900],
    ...                        ['pt3', 2479626.000, 7287458.000, 10.000],
    ...                        ['pt4', 2484097.000, 7292784.000, 8.800],
    ...                        ['pt5', 2484106.000, 7293079.000, 7.300],
    ...                        ['pt6', 2484095.000, 7292891.000, 11.100]])
    >>> 
    >>> dists = scipy.spatial.distance.pdist(coords_arr[:,1:3], 'euclidean')
    >>> dists = scipy.spatial.distance.squareform(dists)
    >>> x, y = np.where((dists >= 8000) & (dists <= 30000))
    >>> [(coords_arr[x[i]][0], coords_arr[y[i]][0], dists[y[i]][x[i]]) for i in xrange(len(x)) if x[i] < y[i]]
    [('pt1', 'pt2', 28959.576688895162), ('pt1', 'pt3', 29042.897927032005)]
    

    【讨论】:

      猜你喜欢
      • 2011-07-26
      • 2017-05-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-09-10
      • 2017-12-20
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多