计算多索引熊猫数据框中值出现的最快方法答案

【问题标题】：Quickest way to count occurrence of values in multi-index pandas dataframe计算多索引熊猫数据框中值出现的最快方法
【发布时间】：2019-06-07 03:48:48
【问题描述】：

我有两个包含许多级别和列的多索引数据框。我正在寻找最快的方法来迭代数据帧并计数，对于每一行，每个数据帧中有多少单元格高于特定值，然后找到两个数据帧的行的交叉点，其中至少得分一个计数。

现在我正在使用 for 循环和 groupby 的组合在数据帧中循环，但是我花了太多时间来找到正确的答案（我的真实数据帧包含数千个级别和数百列）所以我需要找到一种不同的方式来做到这一点。

例如：

idx = pd.MultiIndex.from_product([[0,1],[0,1,2]],names= 
['index_1','index_2'])
 col = ['column_1', 'column_2']


values_list_a=[[1,2],[2,2],[2,1],[-8,1],[2,0],[2,1]]
DFA = pd.DataFrame(values_list_a, idx, col)

DFA:
                   columns_1 columns2
index_1 index_2
  0       0            1        2
          1            2        2
          2            2        1
  1       0            -8       1
          1            2        0
          2            2        1

values_list_b=[[2,2],[0,1],[2,2],[2,2],[1,0],[1,2]]
DFB = pd.DataFrame(values_list_b, idx, col)

DFB:
                   columns_1 columns2
index_1 index_2
  0       0            2        2
          1            0        1
          2            2        2
  1       0            2        2
          1            1        0
          2            1        2

我的期望是：

第 1 步计数发生次数：

DFA:
                   columns_1 columns2 counts
index_1 index_2
  0       0            1        2       1
          1            2        2       2
          2            2        1       1
  1       0            -8       1       0
          1            2        0       1
          2            2        1       1

DFB:
                   columns_1 columns2 counts
index_1 index_2
  0       0            2        2        2
          1            0        1        0
          2            2        2        2
  1       0            2        2        2
          1            1        0        0
          2            1        2        1

第 2 步：计数 >0 的 2 个数据帧的交集应该像这样创建一个新的数据帧（记录在相同索引中得分至少一个计数的两个数据帧的行，并添加一个新的 index_0 级别） . index_0 = 0 应指 DFA， index_0=1 应指 DFB：

DFC:
                            columns_1 columns2 counts
  index_0 index_1 index_2
     0       0       0            1        2       1
                     2            2        1       1
             1       2            2        1       1

     1       0       0            2        2       2
                     2            2        2       2
             1       2            1        2       1

【问题讨论】：

你能提供代码来创建你的DataFrames吗？处理多指数很困难
我正在寻找迭代数据帧和计数的最快方法所以你的specific value 是1?
是任何值 >=2（或大于 1）的单元格

标签： python python-3.x pandas dataframe count

【解决方案1】：

`pd.concat` 然后`magic`

def f(d, thresh=1):
    c = d.gt(thresh).sum(1)
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    return d.assign(counts=c)[mask]

pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

                         column_1  column_2  counts
index_0 index_1 index_2                            
bar     0       0               1         2       1
                2               2         1       1
        1       2               2         1       1
foo     0       0               2         2       2
                2               2         2       2
        1       2               1         2       1

带评论

def f(d, thresh=1):
    # count how many are greater than a threshold `thresh` per row
    c = d.gt(thresh).sum(1)

    # find where `counts` are > `0` for both dataframes
    # conveniently dropped into one dataframe so we can do
    # this nifty `groupby` trick
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    #                                    \-------/
    #                         This is key to broadcasting over 
    #                         original index rather than collapsing
    #                         over the index levels we grouped by

    #     create a new column named `counts`
    #         /------------\ 
    return d.assign(counts=c)[mask]
    #                         \--/
    #                    filter with boolean mask

# Use concat to smash two dataframes together into one
pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

【讨论】：

看起来工作正常，你能解释一下发生了什么吗？
@GiovanniMariaStrampelli 看看是否有帮助。
如果我想在 index_0 处分配一个值怎么办。而不是默认的 [0,1] 使它成为 ['bar','foo'] ？它也可以按索引（而不是按列）工作吗？我的意思是现在我们正在寻找每行是否有值> = 2。如果我想判断每一列中是否有 >=2 的值并做同样的练习怎么办？
@GiovanniMariaStrampelli 更新帖子...应该这样做

【解决方案2】：

df.groupby(['index_0','index_1', 'index2'])

现在，你想使用相当于 sql，即

df.filter(lambda x: len(x.columns_1) > 2)
df.count()

这是一个概念，我不明白你要过滤什么，注意x是一个组，所以需要对它进行操作（len、set、values）等

【讨论】：

在第一步中，我想为每个给定的行找到多少个单元格等于或高于给定值（在示例中为 2）。因此，如果我们仅以第一个 DF 的第一行为例，答案将为 1，因为只有 index=(0,0) 的第 2 列 >=2。如果我们采用索引 =(0,1)，则答案为 2，依此类推，如示例所示
df.filter(lambda x: min(x.columns) >= 2) 这将强制组中列的最小值为 2 ，否则过滤

【解决方案3】：

使用过滤器，.any() 和 pd.merge()

重新创建数据框：

idx = pd.MultiIndex.from_product([[0,1],[0,1,2]], names=['one', 'two'])
columns = ['columns_1', 'columns_2']

DFA = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)
DFB = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)

print(DFA)

             columns_1  columns_2
one two                      
0   0           -1          2
    1            2         -1
    2           -1          0
1   0            1          2
    1            0          0
    2           -1         -1



print(DFB)

             columns_1  columns_2
one two                      
0   0            2         -1
    1            1          2
    2            2          1
1   0            0          0
    1           -1          2
    2            1         -1

在这种情况下过滤数据框的值 > 1。

DFA = DFA.loc[(DFA>1).any(bool_only=True, axis=1),:]
DFB = DFB.loc[(DFB>1).any(bool_only=True, axis=1),:]

print(DFA)

             columns_1  columns_2
one two                      
0   0           -1          2
    1            2         -1
1   0            1          2

print(DFB)

        columns_1  columns_2
one two                      
0   0            2         -1
    1            1          2
    2            2          1
1   1           -1          2

将两者合并。使用 out join 可以让你接近。不确定是否跳出索引，但第一级0 [0,1]是DFA。

         columns_1_x  columns_2_x  columns_1_y  columns_2_y
one two                                                    
0   0           -1.0          2.0          2.0         -1.0
    1            2.0         -1.0          1.0          2.0
1   0            1.0          2.0          NaN          NaN
0   2            NaN          NaN          2.0          1.0
1   1            NaN          NaN         -1.0          2.0

【讨论】：

所以这是一个非常好的主意。如果我有第三列，比如 column_3 并且我想要在至少 2 列中 >=2 的行怎么办？第二点不是我需要的，我认为我必须将 DFA 和 DFB 连接起来，只取得分的那一个
我把它改成了外连接，但是索引还是堆叠在0级
对于三列或更多列，将 .any() 过滤器替换为 this (DFA>1).sum(axis=1)>=2（忘记 astype，不需要。）

pd.concat 然后magic

带评论

`pd.concat` 然后`magic`