【问题标题】:concatenate pandas dataframes with priority replacment of NaN使用 NaN 的优先替换连接 pandas 数据帧
【发布时间】:2019-11-17 12:20:40
【问题描述】:

我从一系列有重叠的乐器中收集到数据。我想将它们合并到单个 pandas 数据结构中,如果不是 NaN,则每列的最新可用数据优先,否则保留旧数据。

下面的代码产生了预期的输出,但是对于这样一个简单的任务涉及很多代码。此外,最后一步涉及识别重复的索引值,我担心我是否可以依赖“最后”部分,因为 df.combine_first(other) 重新排序数据。有没有更紧凑、更高效和/或可预测的方式来做到这一点?

# set up the data
df0 = pd.DataFrame({"x": [0.,1.,2.,3.,4,],"y":[0.,1.,2.,3.,np.nan],"t" :[0,1,2,3,4]})   # oldest/lowest priority
df1 = pd.DataFrame({"x" : [np.nan,4.1,5.1,6.1],"y":[3.1,4.1,5.1,6.1],"t": [3,4,5,6]})
df2 = pd.DataFrame({"x" : [8.2,10.2],"t":[8,10]})
df0.set_index("t",inplace=True)
df1.set_index("t",inplace=True)
df2.set_index("t",inplace=True)

# this concatenates, leaving redundant indices in df0, df1, df2
dfmerge = pd.concat((df0,df1,df2),sort=True)
print("dfmerge, with duplicate rows and interlaced NaN data")
print(dfmerge)

# Now apply, in priority order, each of the original dataframes to fill the original
dfmerge2 = dfmerge.copy()
for ddf in (df2,df1,df0):
    dfmerge2 = dfmerge2.combine_first(ddf)
print("\ndfmerge2, fillable NaNs filled but duplicate indices now reordered")
print(dfmerge2)   # row order has changed unpredictably

# finally, drop duplicate indices
dfmerge3 = dfmerge2.copy()
dfmerge3 = dfmerge3.loc[~dfmerge3.index.duplicated(keep='last')]
print ("dfmerge3, final")
print (dfmerge3)

它的输出是这样的:

dfmerge, with duplicate rows and interlaced NaN data
       x    y
t            
0    0.0  0.0
1    1.0  1.0
2    2.0  2.0
3    3.0  3.0
4    4.0  NaN
3    NaN  3.1
4    4.1  4.1
5    5.1  5.1
6    6.1  6.1
8    8.2  NaN
10  10.2  NaN

dfmerge2, fillable NaNs filled but duplicate indices now reordered
       x    y
t            
0    0.0  0.0
1    1.0  1.0
2    2.0  2.0
3    3.0  3.0
3    3.0  3.1
4    4.0  4.1
4    4.1  4.1
5    5.1  5.1
6    6.1  6.1
8    8.2  NaN
10  10.2  NaN

dfmerge3, final
       x    y
t            
0    0.0  0.0
1    1.0  1.0
2    2.0  2.0
3    3.0  3.1
4    4.1  4.1
5    5.1  5.1
6    6.1  6.1
8    8.2  NaN
10  10.2  NaN

【问题讨论】:

    标签: python pandas numpy dataframe merge


    【解决方案1】:

    你的情况

    s=pd.concat([df0,df1,df2],sort=False)
    s[:]=np.sort(s,axis=0)
    s=s.dropna(thresh=1)
    s
          x    y
    t           
    0   0.0  0.0
    1   1.0  1.0
    2   2.0  2.0
    3   3.0  3.0
    4   4.0  3.1
    3   4.1  4.1
    4   5.1  5.1
    5   6.1  6.1
    6   8.2  NaN
    8  10.2  NaN
    

    【讨论】:

    • 谢谢。使用排序来帮助的想法似乎很有希望,但事实证明这似乎是示例数据特有的,即使不是这样,它也会改变索引并且我的索引是有意义的(最终它将是一个时间戳) .
    猜你喜欢
    • 2020-04-07
    • 1970-01-01
    • 1970-01-01
    • 2018-10-26
    • 1970-01-01
    • 2018-11-08
    • 2020-12-08
    • 2019-06-27
    • 2021-11-26
    相关资源
    最近更新 更多