Vertorize 匹配两个数据框 datetimeindex 比较答案

【问题标题】：Vertorize Matching Two dataframe datetimeindex comparisonVertorize 匹配两个数据框 datetimeindex 比较
【发布时间】：2021-08-31 13:54:05
【问题描述】：

我有两个使用 DatetimeIndex（时间戳）的日期时间排序数据帧，如下所示。

df1
timestamp                            price  side  
2021-08-27 12:45:00.475100160+00:00  47.34  
2021-08-27 12:45:00.475100160+00:00  47.02 
2021-08-27 12:45:00.488067957+00:00  47.18 
2021-08-27 12:45:00.779297294+00:00  47.26 
2021-08-27 12:45:00.779297294+00:00  47.27 

df2
timestamp                            bid_price  ask_price   
2021-08-27 12:44:59.740064471+00:00  47.08  47.34
2021-08-27 12:45:00.475100160+00:00  47.02  47.34
2021-08-27 12:45:00.914411789+00:00  47.02  47.26
2021-08-27 12:45:00.915470114+00:00  47.02  47.34

我需要将第一个数据帧 (df1) 中每一行的 datetimeIndex 与第二个数据帧 (df2) 的 datetimeIndex 进行比较。 df2 中日期时间等于或低于 df1 中行的 datetimeindex 的第一行将用于根据 df1.price 列评估 df2.bid_price 和 df2.ask_price 列。如果 df1.price == df2.bid_price 然后将“出价”添加到 df1.side 列。如果 df1.price == df2.ask_price 然后将“询问”添加到 df1.side 列。如果 df1.price 介于 dff2.ask_price 和 df2.bid_price 之间，则将“内部”添加到 df1.side 列，否则将“外部”添加到 df1.side 列。

我下面的代码是通过迭代 df1 的每一行并将其与 df2 进行比较来执行此操作的效率最低的方法。简而言之，当我开始查看超过 10-20k 行的任何内容时，它需要很长时间。我一直在寻找更有效的方法来做到这一点。

for x in range(len(df1)):
    price = df1.price.iloc[x]
    quote = df2[(df1.index[x] >= df2.index)][['bid_price','ask_price']].iloc[-1]
    if price == quote.bid_price:
        df1.side.iloc[x] = 'Bid'
    elif price == quote.ask_price:
        df1.side.iloc[x] = 'Ask'
    elif (price > quote.bid_price) & (price < quote.ask_price):
        df1.side.iloc[x] = 'Inside'
    else:
        df1.side.iloc[x] = 'Outside'

【问题讨论】：

我认为你可以做的是使用 df1 中的索引使用反向插值在 df2 中找到你想要的行，然后从那里，你可以直接加入两者。我会尝试挖掘我不久前做过的类似事情
感谢大卫，我希望得到有关如何显着加快速度的建议，因为我有时使用 100k+ 数据点。

标签： python python-3.x pandas dataframe

【解决方案1】：

这是一个使用pandas.merge_asof 合并时间戳和numpy.select 以匹配各种条件的工作解决方案：

import numpy as np
df3 = pd.merge_asof(df1, df2, on='timestamp', direction='backward')
df3['side'] = np.select([df3['price']==df3['bid_price'], 
                         df3['price']==df3['ask_price'],
                         df3['price'].between(df3['bid_price'], df3['ask_price'])
                         ],
                        ['Bid', 'Ask', 'Inside'],
                        default='Outside'
                        )

输出：

>>> df3
                            timestamp  price    side  bid_price  ask_price
0 2021-08-27 12:45:00.475100160+00:00  47.34     Ask      47.02      47.34
1 2021-08-27 12:45:00.475100160+00:00  47.02     Bid      47.02      47.34
2 2021-08-27 12:45:00.488067957+00:00  47.18  Inside      47.02      47.34
3 2021-08-27 12:45:00.779297294+00:00  47.26  Inside      47.02      47.34
4 2021-08-27 12:45:00.779297294+00:00  47.27  Inside      47.02      47.34

注意。如果需要，您可以删除中间列：df3.drop(['bid_price', 'ask_price'], axis=1)

【讨论】：