通过创建索引熊猫列表来删除数据框中的重复项答案

【问题标题】：Removing duplicates in dataframe via creating a list of their indices pandas通过创建索引熊猫列表来删除数据框中的重复项
【发布时间】：2021-04-25 22:14:37
【问题描述】：

我有一个数据框 (=used_dataframe)，其中包含重复项。我需要创建一个包含这些重复项索引的列表为此，我使用了在这里找到的函数： Find indices of duplicate rows in pandas DataFrame

def duplicates(x):

    #dataframe = pd.read_csv(x)
    #df = dataframe.iloc[: , 1:]
    df = x

    duplicateRowsDF = df[df.duplicated()]

    df = df[df.duplicated(keep=False)]
    tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!

    n = 1 # N. . .
    indicees = [x[n] for x in tuppl]
    
    return indicees

duplicates(used_df)

我需要的下一个函数是一个，我从数据集中删除重复项，我这样做了：


    x= tidy(mn)

    indices = duplicates(tidy(mn))

    used_df = x
    used_df['indexcol'] = range(0, len(tidy(mn)))
    
    dropped = used_df[~used_df['indexcol'].isin(indices)]

    finito = dropped.drop(columns=['indexcol'])
    
    return finito

handling_duplicate_entries(used_df)

它有效 - 但是当我想检查我的解决方案时（评估，所有重复项都已删除）我通过duplicates(handling_duplicate_entries(used_df))执行的操作应该返回一个空数据框以显示没有重复，它返回错误'DataFrame' object has no attribute 'tolist'. 在上面链接的问题中，这也被添加为评论但没有解决 - 坦率地说，我很想为重复功能找到不同的解决方案，因为我不太了解它，但到目前为止我还没有不。

【问题讨论】：

标签： python pandas

【解决方案1】：

好的。我会尽力做到最好的。

因此，如果您尝试查找重复索引，并希望将这些值存储在列表中，您可以使用以下代码。此外，我还提供了一个小示例来创建一个包含重复值（原始）的数据框，以及没有任何重复数据的数据。

import pandas as pd

# Toy dataset
data = {
    'A': [0, 0, 3, 0, 3, 0],
    'B': [0, 1, 3, 2, 3, 0],
    'C': [0, 1, 3, 2, 3, 0]
}

df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]


duplicates
    A   B   C
0   0   0   0
5   0   0   0
2   3   3   3
4   3   3   3

no_duplicates
   A    B   C
1   0   1   1
3   0   2   2

【讨论】：