Pandas：如何过滤轴的重复值？答案

【问题标题】：Pandas: How to filter repeated values of an axis?Pandas：如何过滤轴的重复值？
【发布时间】：2021-12-27 21:46:00
【问题描述】：

假设我们有一个由以下变量组成的数据框：

Institution (name of the university)
Country (name of the country of the institution)
Year (integer, year in which that university was scored)
World_rank (integer, position in the world rank)
Alumni_employment (integer, number of alumni placements)

我们想要过滤所有美国大学，这些大学在 2015 年排名

虽然前 3 个要求很容易满足，但我被最后一个卡住了。

这是我的尝试：

import pandas as pd
import numpy as np
data = pd.read_csv("data/cwurData.csv")

americanuniv = data[(data.country == 'USA') & (data.year == 2015) & (data.world_rank <= 500)]
for x in data.alumni_employment:
    for y in data.alumni_employment:
        if x == y:   
            print(americanuniv['institution'])

当然，它没有用。老实说，我不知道如何继续完成最后的挑战..你们有没有想过这个？

非常感谢！

【问题讨论】：

最后一个要求有点模棱两可。你可能有 uni A 和 uni B 有相同的 alumni_employment（比如 10），而 uni C 和 uni D 可能有相同的校友就业（比如 20），但不是 A 和 B。你想列出在那种情况下是 A、B、C 和 D？
嗨@Burrito，恐怕我需要对A、B、C和D进行“分组”，前提是他们都有相同的校友就业。我认为迭代将是一个不错的选择，但这正是我陷入困境的地方。感谢您的快速回复！
您可以使用 groupby('alumni_employment') 来获取您想要的组，然后这取决于您之后想要做什么。只打印他们组中的行？
嗯.. 我试着反过来做。首先按校友分组，然后应用过滤器，但没有奏效。我需要打印的是正确过滤和分组的行，以便只显示满足所有条件的行。关键是我该怎么做才能说明“只有那些具有相同校友价值的人”

标签： python pandas dataframe filter

【解决方案1】：

为简单起见，让我们使用以下数据框：

df = pd.DataFrame({'institution': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 
  'alumni_employment': [10, 20, 10, 30, 20, 5, 20]})

要获得具有相同“alumni_employment”的机构，请使用 groupby。然后，过滤以消除大小为 1 的组。

g = df.groupby('alumni_employment')
final = g.filter(lambda x: len(x) > 1)

结果是：


    institution alumni_employment
0   A           10
1   B           20
2   C           10
4   E           20
6   G           20

如果您希望将具有相同“alumni_employment”的那些打印在一起，您可以这样做：

final = final.sort_values('alumni_employment')

【讨论】：

非常感谢墨西哥卷饼！现在你给了我思考的食物:)

【解决方案2】：

    americanuniv = data.loc[(data["country"] == 'USA') & (data["year"] == 2015) & (data["world_rank"] <= 500)]
    americanuniv.groupby(by = "Alumni_employment")["institution"]

【讨论】：

它与我的代码没有'for'循环的效果相同，但感谢您的尝试！