以 1 级索引为条件从多索引数据帧中快速删除行答案

【问题标题】：Dropping rows fast from a multi-indexed data frame conditional on the level 1 index以 1 级索引为条件从多索引数据帧中快速删除行
【发布时间】：2014-09-27 05:46:56
【问题描述】：

我在从多索引 pandas 数据框中快速删除行时遇到问题，其中删除标准基于 1 级索引。或者等效地，通过附加单索引数据帧或 numpy 数组作为行来构建多索引数据帧。在我的具体示例中，我有一个名为“watch”的数据框，例如：

                             userid   watchers
repositoryid         date       
     5910995   1348293168   1449180          1
     5911012   1348292421   2627657          1
     5911046   1367171000   1219404          1
               1368722792   1892225          2
               1383586883   2150178          3
     5911088   1348302179   1521780          1
         ...

其中 repositoryid 和 date 分别是多索引的级别 0 和 1，而 userid 和 watchers 是数据列。所以，对于每个repositoryid，我基本上都有一个时间序列的用户开始观看存储库的事件。对于每个 repositoryid，我还知道其他地方的特定创建日期。现在我想删除 date > creationdate+timewindow 的所有行，其中 timewindow 是一些常数。

我尝试使用 drop() 函数，但这非常慢。我认为布尔掩码将是最好的解决方案，但我无法使其与多索引一起使用。我还尝试了几次尝试从头开始构建一个新的数据框，最近的一个是这样的：

watch_new = DataFrame(columns=['date', 'userid', 'watchers'])
for i,rid in enumerate(watch.index.get_level_values('repositoryid')):
    creationdate = repository.loc[rid].date.squeeze()
    thistimeseries = watch.loc[rid]
    thistimeseries = thistimeseries[thistimeseries.index <= creationdate+timewindow]
    thistimeseries.reset_index(inplace=True)
    if len(thistimeseries) != 0:
        watch_new.loc[rid] = thistimeseries.as_matrix()

不幸的是，只要 thistimeseries.as_matrix() 有不止一行，我就会收到这样的错误消息（在本例中为 10 行）：

ValueError: could not broadcast input array from shape (10,3) into shape (3)

所以，我的问题是，1a) 如何从多索引数据帧中快速删除行，条件是 1 级索引，或者等效地 1b) 如何将单索引数据帧插入多索引数据帧, 和 2) 这甚至是解决我的问题的最好=最快的方法，还是我应该尝试一种完全不同的方法？

（我也尝试不使用索引，但这太慢了。我玩过 join、merge、groupby 等，不幸的是我没有设法让它们解决我的问题。我花了最后 5 天的时间学习优秀的书“Python for Data Analysis”并试图在网上找到这个问题的解决方案，再次没有成功。我希望也许高级pandas用户对这个看似简单的问题有一个优雅的解决方案？提前非常感谢！）

【问题讨论】：

标签： python pandas

【解决方案1】：

从这个设置开始：

import pandas as pd
df = pd.read_table('data', sep='\s+').set_index(['repositoryid', 'date'])
repository = pd.read_table('data2', sep='\s+').set_index(['repositoryid'])
timewindow = 100

假设我们有df:

                          userid  watchers
repositoryid date                         
5910995      1348293168  1449180         1
5911012      1348292421  2627657         1
5911046      1367171000  1219404         1
             1368722792  1892225         2
             1383586883  2150178         3
5911088      1348302179  1521780         1

和repository:

                    date
repositoryid            
5910995       1348293200
5911012       1348292400
5911046       1368722800
5911088       1348303000

目前，Pandas 不支持仅在 MultiIndex 的某些级别上进行合并。所以df 和repository 必须具有相同的索引形状才能被合并：

df.reset_index(level='date', inplace=True)
df = df.join(repository, rsuffix='_threshold')

产量

                    date   userid  watchers  date_threshold
repositoryid                                               
5910995       1348293168  1449180         1      1348293200
5911012       1348292421  2627657         1      1348292400
5911046       1367171000  1219404         1      1368722800
5911046       1368722792  1892225         2      1368722800
5911046       1383586883  2150178         3      1368722800
5911088       1348302179  1521780         1      1348303000

现在您可以将timewindow 添加到date_threshold：

df['date_threshold'] += timewindow

当date小于date_threshold时比较：

mask = df['date'] < df['date_threshold']

产生一个布尔系列，如

In [207]: mask
Out[207]: 
repositoryid
5910995          True
5911012          True
5911046          True
5911046          True
5911046         False
5911088          True
dtype: bool

使用布尔系列很容易使用df.loc选择所需的行：

In [208]: df.loc[mask]
Out[208]: 
                    date   userid  watchers  date_threshold
repositoryid                                               
5910995       1348293168  1449180         1      1348293300
5911012       1348292421  2627657         1      1348292500
5911046       1367171000  1219404         1      1368722900
5911046       1368722792  1892225         2      1368722900
5911088       1348302179  1521780         1      1348303100

或者，您可以使用query，而不是使用mask 和df.loc：

In [213]: df.query('date < date_threshold')
Out[213]: 
                    date   userid  watchers  date_threshold
repositoryid                                               
5910995       1348293168  1449180         1      1348293300
5911012       1348292421  2627657         1      1348292500
5911046       1367171000  1219404         1      1368722900
5911046       1368722792  1892225         2      1368722900
5911088       1348302179  1521780         1      1348303100

【讨论】：

谢谢您，这非常有效！我想我现在会尽量避免使用多指数，因为它们似乎有点受限。