想法是通过Series.str.split 将两个列/系列都转换为DataFrames,然后通过DataFrame.merge 使用交叉连接,获取差异并对它们求和,通过DataFrameGroupBy.idxmin 和DataFrame.loc 获得差异最小的行,最后加入DataFrame.pop 用于使用和删除列:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = df['0_x'].sub(df['0_y']).abs() + df['1_x'].sub(df['1_y']).abs()
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0 480x480
28 382x320 197 215x350
16 480x480 0 320x320
45 676x640 76 320x360
59 694x640 94 320x360
106 1080x1280 1196 2020x1024
78 1080x1280 1299 2123x1024
97 2094x1600 674 1080x900
与欧几里得距离类似,输出样本数据相同:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = np.sqrt(df['0_x'].sub(df['0_y']) ** 2 + df['1_x'].sub(df['1_y']) ** 2)
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0.000000 480x480
30 382x480 143.627992 215x320
16 480x480 0.000000 320x320
45 676x640 53.814496 320x360
59 694x640 67.201190 320x360
106 1080x1280 1015.881883 2020x1024
78 1080x1280 1117.050133 2123x1024
97 2094x1600 525.771814 1080x900
另一个 numpy 解决方案:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
a1 = np.sqrt(np.square(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.square(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b1'] = df2['col'].to_numpy()[np.argmin(a1, axis=1)]
a2 = (np.abs(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.abs(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b2'] = df2['col'].to_numpy()[np.argmin(a2, axis=1)]
print (df1)
col b1 b2
1 320x480 320x480 320x480
2 480x320 480x320 480x320
3 382x215 480x320 320x350
4 676x320 640x360 640x360
5 694x320 640x360 640x360
6 1080x2123 1280x1024 1280x1024
7 2094x1080 1600x900 1600x900
8 1080x2020 1280x1024 1280x1024