【问题标题】:Get Pandas Duplicate Row Count with Original Index使用原始索引获取 Pandas 重复行数
【发布时间】:2017-05-02 01:54:13
【问题描述】:

我需要在 Pandas 数据框中找到重复的行,然后添加一个带有计数的额外列。假设我们有一个数据框:

>>print(df)

+----+-----+-----+-----+-----+-----+-----+-----+-----+
|    |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
|  0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  1 |   2 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  2 |   2 |   4 |   3 |   4 |   1 |   1 |   4 |   4 |
|  3 |   4 |   3 |   4 |   0 |   0 |   0 |   0 |   0 |
|  4 |   2 |   3 |   4 |   3 |   4 |   0 |   0 |   0 |
|  5 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  6 |   4 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |
|  7 |   1 |   1 |   4 |   0 |   0 |   0 |   0 |   0 |
|  8 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  9 |   4 |   3 |   4 |   0 |   0 |   0 |   0 |   0 |
| 10 |   3 |   3 |   4 |   3 |   5 |   5 |   5 |   0 |
| 11 |   5 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |
| 12 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
| 13 |   0 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |
| 14 |   2 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
| 15 |   1 |   3 |   5 |   0 |   0 |   0 |   0 |   0 |
| 16 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
| 17 |   3 |   3 |   4 |   4 |   0 |   0 |   0 |   0 |
| 18 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+

然后上面的帧将变成下面的帧,并带有一个带有计数的附加列。可以看到我们还在保留索引列。

+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
|  0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   2 |
|  1 |   2 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   2 |
|  2 |   2 |   4 |   3 |   4 |   1 |   1 |   4 |   4 |   1 |
|  3 |   4 |   3 |   4 |   0 |   0 |   0 |   0 |   0 |   2 |
|  4 |   2 |   3 |   4 |   3 |   4 |   0 |   0 |   0 |   1 |
|  5 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   3 |
|  6 |   4 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
|  7 |   1 |   1 |   4 |   0 |   0 |   0 |   0 |   0 |   1 |
| 10 |   3 |   3 |   4 |   3 |   5 |   5 |   5 |   0 |   1 |
| 11 |   5 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
| 13 |   0 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
| 15 |   1 |   3 |   5 |   0 |   0 |   0 |   0 |   0 |   1 |
| 16 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
| 17 |   3 |   3 |   4 |   4 |   0 |   0 |   0 |   0 |   1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

我已经看到了其他解决方案,例如:

 df.groupby(list(df.columns.values)).size()

但这会返回一个有间隙且没有初始索引的矩阵。

【问题讨论】:

    标签: python pandas group-by aggregate multiple-columns


    【解决方案1】:

    您可以先使用reset_indexindex 转换为列,然后通过firstlen 使用aggregate

    此外,如果需要按所有列分组,则必须按difference 删除index 列:

    print (df.columns.difference(['index']))
    Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
    
    print (df.reset_index()
             .groupby(df.columns.difference(['index']).tolist())['index']
             .agg(['first', 'size'])
             .reset_index()
             .set_index(['first'])
             .sort_index()
             .rename_axis(None))
    
        2  3  4  5  6  7  8  9  size
    0   0  0  0  0  0  0  0  0     2
    1   2  0  0  0  0  0  0  0     2
    2   2  4  3  4  1  1  4  4     1
    3   4  3  4  0  0  0  0  0     2
    4   2  3  4  3  4  0  0  0     1
    5   5  0  0  0  0  0  0  0     3
    6   4  5  0  0  0  0  0  0     1
    7   1  1  4  0  0  0  0  0     1
    10  3  3  4  3  5  5  5  0     1
    11  5  4  0  0  0  0  0  0     1
    13  0  4  0  0  0  0  0  0     1
    15  1  3  5  0  0  0  0  0     1
    16  4  0  0  0  0  0  0  0     1
    17  3  3  4  4  0  0  0  0     1
    

    如果需要添加下一列10需要rename

    #if necessary convert to str
    last_col = str(df.columns.astype(int).max() + 1)
    print (last_col)
    10
    
    print (df.reset_index()
            .groupby(df.columns.difference(['index']).tolist())['index']
            .agg(['first', 'size'])
            .reset_index()
            .set_index(['first'])
            .sort_index()
            .rename_axis(None)
            .rename(columns={'size':last_col}))
    
        2  3  4  5  6  7  8  9  10
    0   0  0  0  0  0  0  0  0   2
    1   2  0  0  0  0  0  0  0   2
    2   2  4  3  4  1  1  4  4   1
    3   4  3  4  0  0  0  0  0   2
    4   2  3  4  3  4  0  0  0   1
    5   5  0  0  0  0  0  0  0   3
    6   4  5  0  0  0  0  0  0   1
    7   1  1  4  0  0  0  0  0   1
    10  3  3  4  3  5  5  5  0   1
    11  5  4  0  0  0  0  0  0   1
    13  0  4  0  0  0  0  0  0   1
    15  1  3  5  0  0  0  0  0   1
    16  4  0  0  0  0  0  0  0   1
    17  3  3  4  4  0  0  0  0   1
    

    【讨论】:

    • 很高兴能帮到你!
    猜你喜欢
    • 2013-12-10
    • 1970-01-01
    • 2015-09-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-04-12
    • 1970-01-01
    • 2019-01-06
    相关资源
    最近更新 更多