使用 pandas 处理重复数据答案

【问题标题】：Handling duplicate data with pandas使用 pandas 处理重复数据
【发布时间】：2019-11-18 21:49:42
【问题描述】：

大家好，我在使用 pandas python 库时遇到了一些问题。基本上我正在阅读 csv 使用 pandas 文件并希望删除重复项。我已经尝试了所有方法，问题仍然存在。

import sqlite3
import pandas as pd
import numpy
connection = sqlite3.connect("test.db")

## pandas dataframe
dataframe = pd.read_csv('Countries.csv')
##dataframe.head(3)

countries = dataframe.loc[:, ['Retailer country', 'Continent']] 

countries.head(6)

这个输出将是：

 Retailer country Continent
-----------------------------
0 United States    North America
1 Canada           North America
2 Japan                    Asia
3 Italy                   Europe
4 Canada           North America
5 United States    North America
6 France                  Europe

我希望能够根据来自的列删除重复值上面的数据框，所以我会有来自每个国家和大陆的独特价值观这样所需的输出将是：

 Retailer country Continent
-----------------------------
0 United States    North America
1 Canada           North America
2 Japan                    Asia
3 Italy                   Europe
4 France                  Europe

我尝试了那里提到的一些方法：Using pandas for duplicate values 并环顾网络并意识到我可以使用 df.drop_duplicates() 函数，但是当我使用下面的代码和 df.head(3) 函数时，它只显示一个排。我能做些什么来获得这些独特的行并最终遍历它们？

countries.head(4)
country = countries['Retailer country']
continent = countries['Continent']
df = pd.DataFrame({'a':[country], 'b':[continent]})
df.head(3)

【问题讨论】：

目前尚不清楚最后一段代码试图做什么 - 它不包含您提到的 drop_duplicates 函数（这似乎是这个问题的答案），看起来像唯一的效果是创建一个新的DataFrame，将Retailer country / Continent 列重命名为a/b，并将所有值组合成每列的1行...
您查看过文档吗？有一个子集函数df.drop_duplicates(subset='Retailer Country')
@Datanovice 是正确的。只是不要忘记重新分配回 df。
谢谢大家的建议，我知道我可以使用 drop_duplicates 之类的函数来解决我的问题，但问题出在这一行：df = pd.DataFrame({'a':[country ], 'b':[continent]})，当我使用 df.head 输出这一行时，它没有返回任何行，但现在它抛出错误“模块 'pandas' 没有属性 'Dataframe'”

标签： python-3.x pandas dataframe unique

【解决方案1】：

似乎一个简单的分组可以解决您的问题。

import pandas as pd
na = 'North America'
a = 'Asia'
e = 'Europe'
df = pd.DataFrame({'Retailer': [0, 1, 2, 3, 4, 5, 6],
                   'country': ['Unitied States', 'Canada', 'Japan', 'Italy', 'Canada', 'Unitied States', 'France'],
                   'continent': [na, na, a, e, na, na, e]})

df.groupby(['country', 'continent']).agg('count').reset_index()

Retailer 列现在显示country、continent 组合出现的次数。您可以通过 `df = df[['country', 'continent']] 删除它。

【讨论】：

Ty，这解决了我的问题，现在我得到了按国家/地区排序的唯一值 :)