按 id 保持第一次出现的行，并在列中的值发生变化时保持第一次出现答案

【问题标题】：Keep first occurrence row by id and first occurrence when value in column changes按 id 保持第一次出现的行，并在列中的值发生变化时保持第一次出现
【发布时间】：2020-09-14 00:19:05
【问题描述】：

在以下示例 df 中，最好的保留方法是：

每个id 出现Score 的第一行
然后是每个 id 的值在 Score 中发生变化时的第一行，并删除重复的行直到它发生变化

示例 df

      date      id   Score
0   2001-09-06  1       3
1   2001-09-07  1       3
2   2001-09-08  1       4
3   2001-09-09  2       6
4   2001-09-10  2       6
5   2001-09-11  1       4
6   2001-09-12  2       5
7   2001-09-13  2       5
8   2001-09-14  1       3

所需的df

      date      id   Score
0   2001-09-06  1       3
1   2001-09-08  1       4
2   2001-09-09  2       6
3   2001-09-12  2       5
4   2001-09-14  1       3

【问题讨论】：

标签： python pandas dataframe duplicates unique

【解决方案1】：

将groupby 与diff 一起使用：

print (df[df.groupby("id")["Score"].diff()!=0])

         date  id  Score
0  2001-09-06   1      3
2  2001-09-08   1      4
3  2001-09-09   2      6
6  2001-09-12   2      5
8  2001-09-14   1      3

第一次出现总是会导致NaN which !=0。

【讨论】：

【解决方案2】：

根据你的逻辑：

# shift Score within id
# shifted score at each group start is `NaN`
shifted_scores = df['Score'].groupby(df['id']).shift()

# change of Score within each id
# since first shifted score in each group is `NaN`
# mask is also True at first line of each group
mask = df['Score'].ne(shifted_scores)

# output
df[mask]

输出：

         date  id  Score
0  2001-09-06   1      3
2  2001-09-08   1      4
3  2001-09-09   2      6
6  2001-09-12   2      5
8  2001-09-14   1      3

【讨论】：

【解决方案3】：

df.groupby(['id', 'score']).first()

【讨论】：

它是否符合 OP 的预期输出？
这将错过示例数据中的最后一行，因为它在两列上都是重复的。