使用 pandas 多索引进行搜索优化答案

【问题标题】：Search optimization with pandas multi-index使用 pandas 多索引进行搜索优化
【发布时间】：2018-12-28 18:36:42
【问题描述】：

我想知道是否有办法优化我正在做的搜索。我有一个多索引（3 级）数据框 df，如下所示：

IndexID IndexDateTime IndexAttribute ColumnA ColumnB
   1      2015-02-05        8           A       B
   1      2015-02-05        7           C       D
   1      2015-02-10        7           X       Y

我的问题是我想知道给定日期，例如2015-02-10 是否在ColumnA 中具有相同的IndexID 和IndexAttribute 的数据，前一天（在这种情况下为 5 )，如果有，获取它并将其添加到新列中，如下所示：

IndexID IndexDateTime IndexAttribute ColumnA ColumnB NewColumn
   1      2015-02-05        8           A       B       -1
   1      2015-02-05        7           C       D       -1
   1      2015-02-10        7           X       Y        C

我想在我的数据框中的每一行上执行这个搜索，它有 1900 万行。我这样做的方式是：

df['NewColumn'] = df.apply(lambda r: get_data(df, r.IndexID, r.IndexDateTime , r.IndexAttribute , 5), axis=1)

get_data 在哪里：

def get_data(df, IndexID, IndexDateTime , IndexAttribute , days_before):
    idx = pd.IndexSlice
    date = (IndexID - pd.to_timedelta(days_before, 'd'))
    try:
        res = df.loc[idx[IndexID, date, IndexAttribute ],'ColumnA']
        return res
    except KeyError:
        return -1

这非常慢，需要 2 多个小时。我想知道这是否可以是一种更快的方法。问题：

搜索的日期可能存在，也可能不存在。
对于每个IndexDateTame我不知道有多少个IndexAttributes。它们是 int 思想，它们是按降序排列的。

我不能换班，因为我不知道两行中间有多少数据。一些想法？谢谢！

【问题讨论】：

标签： python-3.x pandas search optimization multi-index

【解决方案1】：

使用 numpy 可以非常快。您只需要将数据框中的列作为 numpy 数组进行迭代。希望对您有所帮助：

%time
def myfunc(df, days_before=5):

     # Fill A column witH -1s
     result = -np.ones_like(df.values[:, -1:])

     # Slice the first 3 columns and shift the dates 
     # to get the index that we are looking for
     idx = np.array((df['IndexID'].values,
                     df['IndexDateTime'] - pd.to_timedelta(days_before, 'd'),
                     df['IndexAttribute'].values)).T

     # Look for days matching in the first 3 columns
     _idx_comp = df.values[:, :3][np.newaxis, :] == np.array(idx)[:, np.newaxis]

     # Get the index where there is a match
     # between the row of the dataframe and the desired searched rows
     idx_found = np.where(np.all(_idx_comp, axis=-1))

     # Assign the corresponding rows to its required value
     result[idx_found[0]] = df['ColumnA'].values[idx_found[-1]]

     return result

df.assign(NewColumn=myfunc(df))

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.96 µs

   IndexID IndexDateTime  IndexAttribute ColumnA ColumnB NewColumn
0        1    2015-02-05               8       A       B        -1
1        1    2015-02-05               7       C       D        -1
2        1    2015-02-10               7       X       Y         C

【讨论】：

我会在有时间的时候测试它并发布结果。谢谢你，真的！
它工作得很好，除了我在result[idx_found[0]] = df['ColumnA'].values[idx_found[-1]] 中遇到错误，因为您试图在列形状（X，1）中设置行形状（X，）。我对行进行了重塑，但这是进行搜索的好方法。现在是测试时间。
这很奇怪，我没有收到任何错误。我使用相同的数据框，没有多索引，并且日期转换为日期时间df['IndexDateTime'] = df['IndexDateTime'].apply(pd.to_datetime)。可能跟版本有关
我正在使用 pandas：0.23.4 和 numpy：1.15.3 和 python：3.6.2。
熊猫0.23.4，Numpy 1.15.2，Python 3.6.0。这很奇怪，但无论如何，只是让你知道这大约需要 10 分钟，很棒的改进！

【解决方案2】：

这是 O(m.n) 的解决方案，但应该比原来的解决方案更快

l = []
for _, y in df.groupby(level=[0, 2], sort=False):
    s = y.index.get_level_values(level=1).values
    l.append(((s - s[:, None]) / np.timedelta64(1, 'D') == -5).dot(y.ColumnA.values))

df['NewCOL'] = np.concatenate(l)
df

Out[48]: 
                                     ColumnA ColumnB NewCOL
IndexID IndexDateTime IndexAttribute                       
1       2015-02-05    8                    A       B       
                      7                    C       D       
        2015-02-10    7                    X       Y      C

【讨论】：

我有点明白你在那里做什么，但附加行让我感到困惑。你能解释一下吗？
@Soutuyo 添加它来存储for循环的结果，这就是所谓的numpy广播
稍后会仔细检查并回答时间，但那个部门np.timedelta64(1, 'D') == -5让我有点失落。
好的，如果我没记错的话，您正在检查 groupby IndexID 和 IndexAttribute 中的每个日期。如果存在 5 天的差异，您可以进行乘法运算以仅获得该真实值，对吗？
我在测试这段代码，如果IndexDateTime多于IndexAttribute，在某些情况下返回的数据是错误的。每个IndexDateTime 可以有1 到x 个IndexAttributes，所以你不能像这样附加s。