【问题标题】:Find nearest value in dataframe?在数据框中找到最接近的值?
【发布时间】:2022-01-21 21:05:45
【问题描述】:

假设我有一个如下所示的数据框,

     0           1               2               3               4
0   (989, 998)  (1074, 999)     (1159, 1000)    (1244, 1001)    (1329, 1002)
1   (970, 1042) (1057, 1043)    (1143, 1044)    (1230, 1045)    (1316, 1046)
2   (951, 1088) (1039, 1089)    (1127, 1090)    (1214, 1091)    (1302, 1092)
3   (930, 1137) (1020, 1138)    (1109, 1139)    (1198, 1140)    (1287, 1141)
4   (909, 1188) (1000, 1189)    (1091, 1190)    (1181, 1191)    (1271, 1192)

每个单元格在元组中都有 x 和 y 坐标。我有一个名为 I 的输入,它也是元组中的 x 和 Y 坐标。我的目标是找到输入 I 的最近点。

示例输入:

(1080, 1000)

样本输出:

(1074, 999)

我已经尝试了下面的sn-p。

def find_nearest(array, key):
    min_ = 1000
    a = 0
    b = 0
    for item in array:
        diff = abs(item[0]-key[0])+abs(item[1]-key[1])
        if diff<min_:
            min_ = diff
            a,b = item
        if diff==0:
            return (a,b)
    return (a,b)
find_nearest(sum(df.values.tolist(), []), I)

这给了我我的预期。但是,有没有有效的解决方案?

【问题讨论】:

  • 我非常感谢所有的努力。非常感谢你们。

标签: python python-3.x pandas numpy


【解决方案1】:

试试:

# Setup
data = [[(989, 998), (1074, 999), (1159, 1000), (1244, 1001), (1329, 1002)],
        [(970, 1042), (1057, 1043), (1143, 1044), (1230, 1045), (1316, 1046)],
        [(951, 1088), (1039, 1089), (1127, 1090), (1214, 1091), (1302, 1092)],
        [(930, 1137), (1020, 1138), (1109, 1139), (1198, 1140), (1287, 1141)],
        [(909, 1188), (1000, 1189), (1091, 1190), (1181, 1191), (1271, 1192)]]
df = pd.DataFrame(data)

l = (1080, 1000)

out = min(df.to_numpy().flatten(), key=lambda c: (c[0]- l[0])**2 + (c[1]-l[1])**2)
print(out)

# Output:
(1074, 999)

更新

有什么办法,我可以获得最近元素的df索引吗?

dist = df.stack().apply(lambda c: (c[0]- l[0])**2 + (c[1]-l[1])**2)
idx = dist.index[dist.argmin()]
val = df.loc[idx]

print(idx)
print(val)

# Output:
(0, 1)
(1074, 999)

更新 2

但是,这个问题有没有有效的解决方案?

arr = df.to_numpy().astype([('x', int), ('y', int)])
dist = (arr['x'] - l[0])**2 + (arr['y'] - l[1])**2
idx = tuple(np.argwhere(dist == np.min(dist))[0])
val = arr[idx]  # or df.loc[idx]

【讨论】:

  • 感谢您的解决方案。有什么办法,我可以获得最近元素的 df 索引吗?
  • @MohamedThasinah。我更新了我的答案。请检查一下好吗?
  • 再次感谢@Corralien
【解决方案2】:

我写的这个sn-p怎么样?

# cordinates: np.ndarray(n, 2)
def find_nearest(cordinates, x, y):
    x_d = np.abs(cordinate[:, 0] - x)
    y_d = np.abs(cordinate[:, 1] - y)
    nearest_idx = np.argmin(x_d  + y_d)
    return cordinate[nearest_idx]

【讨论】:

    【解决方案3】:

    您可以使用 swifter 和 applymap 来加快处理速度

    I = (1080, 1000)
    
    diff = df.swifter.applymap(lambda item: abs(item[0]-I[0])+abs(item[1]-I[1]))
    
    col_index = diff.min(axis=0)[diff.min(axis=0) == diff.min(axis=0).min()].index[0]
    row_index = diff.min(axis=1)[diff.min(axis=1) == diff.min(axis=1).min()].index[0]
    
    df.loc[row_index, col_index]
    

    【讨论】:

      【解决方案4】:

      看来您只需要一个两列 DataFrame 并找到每行与样本坐标之间的距离。所以这是我的实现:

      复制时您的数据以字符串形式出现。你实际上并不需要这一行:

      data = pd.Series(df.to_numpy().flatten()).str.strip().str.strip('()').str.split(',', expand=True).astype(int)
      sample = (1080, 1000)
      

      解决方案从这里开始:

      distances = data.apply(lambda x: (x[0]-sample[0])**2+(x[1]-sample[1])**2, axis=1)
      out = tuple(data[distances == distances.min()].to_numpy()[0])
      

      输出:

      (1074, 999)
      

      【讨论】:

        【解决方案5】:

        您可以使用nmslib 库,它使您能够进行 K-Nearest-Neighbor Searching。看看example,您可以轻松实现这样的系统。

        PS 对于一个简单的程序来说这可能有点矫枉过正,但它是解决问题的好方法,简单且特别快速!

        【讨论】:

          【解决方案6】:

          通过某个最小值min_ 过滤的解决方案,解决方案是由DataFrame.stackDataframe 构造函数创建DataFrame,然后减去I,幂DataFrame.powsum,最后使用索引Series.idxmin:

          I = (1080, 1000)
          
          min_ = 1000
          s1 = df.stack()
          s = pd.DataFrame(s1.to_list(), index=s1.index).sub(I).pow(2).sum(axis=1)
          s = s[s < min_]
          
          out = (0, 0) if s.empty else s[s.idxmin()]
          print (out)
          

          对于索引:

          idx = 'no match' if s.empty else s.idxmin()
          print (idx)
          (0, '1')
          

          如果不需要过滤:

          I = (1080, 1000)
          
          s1 = df.stack()
          s = pd.DataFrame(s1.to_list(), index=s1.index).sub(I).pow(2).sum(axis=1)
          out = s[s.idxmin()]
          print (out)
          (1074, 999)
          
          print(s.idxmin())
          (0, '1')
          

          【讨论】:

            【解决方案7】:

            你可以这样使用

            import pandas as pd
            from scipy.spatial import distance
            
            data = [(989, 998), (1074, 999), (1159, 1000), (1244, 1001), (1329, 1002),
                    (970, 1042), (1057, 1043), (1143, 1044), (1230, 1045), (1316, 1046),
                    (951, 1088), (1039, 1089), (1127, 1090), (1214, 1091), (1302, 1092),
                    (930, 1137), (1020, 1138), (1109, 1139), (1198, 1140), (1287, 1141),
                    (909, 1188), (1000, 1189), (1091, 1190), (1181, 1191), (1271, 1192)]
            df = pd.DataFrame(data)
            df.columns = ['x', 'y']
            def find_nearest( df, x, y):
                min_distance = float('inf')
                index_of_closest = -1
                for index, pos in enumerate(df.values):
                    x_coord, y_coord = pos
                    current_distance = distance.euclidean((x, y), (x_coord, y_coord))
                    if current_distance < min_distance and current_distance != 0 :
                        min_distance = current_distance
                        index_of_nearest= index
                return index_of_nearest
            
            print("index=",find_nearest(df,1080, 1000),"value=",data[find_nearest(df,1080, 1000)])
            

            或者这样,它会给每个元素最近的,你需要排序。

            df.iloc[-1]=[1080, 1000]
            z = np.array([[complex(c[0], c[1]) for c in df.values]])
            Distance = abs(z.T - z)
            distance = Distance
            masked_a = np.ma.masked_equal(distance, 0.0, copy=False)
            index=np.argmin(masked_a[:, len(masked_a)-1])
            print("index=",index,"value=",df.loc[index])
            

            更新

            import numpy as np
            import pandas as pd
            from scipy.spatial import distance
            import timeit
            
            data = [(989, 998), (1074, 999), (1159, 1000), (1244, 1001), (1329, 1002),
                    (970, 1042), (1057, 1043), (1143, 1044), (1230, 1045), (1316, 1046),
                    (951, 1088), (1039, 1089), (1127, 1090), (1214, 1091), (1302, 1092),
                    (930, 1137), (1020, 1138), (1109, 1139), (1198, 1140), (1287, 1141),
                    (909, 1188), (1000, 1189), (1091, 1190), (1181, 1191), (1271, 1192)]
            df = pd.DataFrame(data)
            df.columns = ['x', 'y']
            def find_nearest( df, x, y):
                min_distance = float('inf')
                index_of_closest = -1
                for index, pos in enumerate(df.values):
                    x_coord, y_coord = pos
                    current_distance = distance.euclidean((x, y), (x_coord, y_coord))
                    if current_distance < min_distance and current_distance != 0 :
                        min_distance = current_distance
                        index_of_nearest= index
                return index_of_nearest
            starttime = timeit.default_timer()
            print(data[find_nearest(df,1080, 1000)])
            print("The time difference 1 is :", timeit.default_timer() - starttime)
            #or
            starttime = timeit.default_timer()
            df.iloc[-1]=[1080, 1000]
            z = np.array([[complex(c[0], c[1]) for c in df.values]])
            Distance = abs(z.T - z)
            masked_a = np.ma.masked_equal(Distance, 0.0, copy=False)
            print(df.iloc[np.argmin(masked_a[:, len(masked_a)-1])])
            print("The time difference 2 is :", timeit.default_timer() - starttime)
            
            data = [[(989, 998), (1074, 999), (1159, 1000), (1244, 1001), (1329, 1002)],
                    [(970, 1042), (1057, 1043), (1143, 1044), (1230, 1045), (1316, 1046)],
                    [(951, 1088), (1039, 1089), (1127, 1090), (1214, 1091), (1302, 1092)],
                    [(930, 1137), (1020, 1138), (1109, 1139), (1198, 1140), (1287, 1141)],
                    [(909, 1188), (1000, 1189), (1091, 1190), (1181, 1191), (1271, 1192)]]
            df = pd.DataFrame(data)
            starttime = timeit.default_timer()
            l = (1080, 1000)
            out = min(df.to_numpy().flatten(), key=lambda c: (c[0]- l[0])**2 + (c[1]-l[1])**2)
            print(out)
            print("The time difference for method 3 is :", timeit.default_timer() - starttime)
            
            starttime = timeit.default_timer()
            dist = df.stack().apply(lambda c: (c[0]- l[0])**2 + (c[1]-l[1])**2)
            idx = dist.index[dist.argmin()]
            val = df.loc[idx]
            
            print(idx)
            print(val)
            print("The time difference for method 4 is :", timeit.default_timer() - starttime)
            
            starttime = timeit.default_timer()
            arr = df.to_numpy().astype([('x', int), ('y', int)])
            dist = (arr['x'] - l[0])**2 + (arr['y'] - l[1])**2
            idx = tuple(np.argwhere(dist == np.min(dist))[0])
            val = arr[idx]  # or df.loc[idx]
            print(val)
            print("The time difference for method 5 is :", timeit.default_timer() - starttime)
            
            starttime = timeit.default_timer()
            I = (1080, 1000)
            
            s1 = df.stack()
            s = pd.DataFrame(s1.to_list(), index=s1.index).sub(I).pow(2).sum(axis=1)
            out = s[s.idxmin()]
            print (out)
            (1074, 999)
            
            print(s.idxmin())
            (0, '1')
            print("The time difference for method 6 is :", timeit.default_timer() - starttime)
            

            从所有答案中,我发现Corralien的答案是最快的。

            更新 2

            但是,在更大的 Dataframe 中,它开始下拉:

            【讨论】:

            • hmmm,也许最好在较大的 Dataframe 中进行测试,5 行 5 列是用于测试的小数据。
            • 你是对的,因为在我的答案中,我使用第二个答案,因为它更快,但现在它比我的第一个答案更糟糕。但是,我曾经为所有矩阵找到最近的。
            • 是的,尝试使用 100k 行,然后性能应该不同,也许不是,不知道。
            • 你是对的,它比其他0.0002272040001116693 受到的影响更大,而0.0009154750150628388
            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2020-09-16
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2012-05-15
            • 1970-01-01
            相关资源
            最近更新 更多