【发布时间】:2020-11-07 02:33:45
【问题描述】:
我尝试连接两个 Pandas DataFrame,但连接错误。
初始数据集如下所示:
df
>>>
well qoil cum_oil wct top_perf bot_perf st x y
5233 101 259 3.684131e+05 97 -2352.13 -2359.12 0 517228 5931024
12786 102 3495 1.369303e+06 5.47 -2352.92 -2566.81 0 517192 5927187
13062 103 2691 1.353718e+06 0.5 -2377.93 -2581.73 0 517731 5926430
. . . .
65 rows × 9 columns
然后我从 x 和 y 坐标(最后两列)生成每个井之间的欧几里得距离:
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('euclidean')
loc = pd.DataFrame(dist.pairwise(df[['x','y']].to_numpy()),
columns=df.well.unique(), index=df.well.unique())
并接收 65x65 矩阵(pandas.core.frame.DataFrame 类型),其中包含每个井之间的距离
loc
>>>
101 102 103 . . .
101 0.000000 152.278917 270.835312 . . .
102 151.278917 0.000000 326.310146 . . .
103 270.835312 346.310146 0.000000 . . .
. . .
然后我删除额外的列并连接两个数据框:
df_train_prep = df.drop(['well', 'wct', 'x', 'y'], axis=1)
df2 = pd.concat([df_train_prep, loc], axis=1)
因此,我收到的不是 65 行 x (9 + 65) 列数据帧,而是 130 行 x 70 列 df,例如:
df2
>>>
qoil cum_oil top_perf bot_perf st 101 102 103 . . .
236 0.001 542681.0 -2427.66 -2539.25 0.0 NaN NaN NaN NaN NaN ...
258 2291 292356.0 -2537.38 -2657.02 1.0 NaN NaN NaN NaN NaN ...
537 3290 237163.0 -2714.32 -2741.49 0.0 NaN NaN NaN NaN NaN ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
101 NaN NaN NaN NaN NaN 0.000000 157.278917 280.835312 323.423701 ...
102 NaN NaN NaN NaN NaN 154.278917 0.000000 356.310146 210.348200 518.786999 ...
看起来有些数据在右侧连接,但有些数据移到了底部。此外,还弹出了奇怪的 NaN 值。 请帮助我理解我做错了什么。
【问题讨论】:
标签: python pandas dataframe scikit-learn euclidean-distance