【发布时间】:2021-05-23 00:00:52
【问题描述】:
我正在尝试在 pandas 数据框中创建一个新列。如果该行的datecol 列是最早日期,我希望该列具有特定groupcol 值的值“是”。但是,有些组有重复的datecol 值,所以如果有最早的datecol 的重复值,我想使用publishcol 列中最早的日期。最后,基于groupcol 的每个组值应该只有一行,其中有一个值为“yes”的新列。
这就是我所拥有的:
| groupcol | namecol | datecol | publishcol |
| -------- | ------- | ---------- | ---------- |
| A | Bob | 2020-01-01 | 2020-01-01 |
| A | Ralph | 2020-01-01 | 2020-01-04 |
| B | Carl | 2020-04-04 | 2020-04-04 |
| B | Joe | 2020-04-04 | 2020-05-05 |
| B | Fred | 2020-03-04 | 2020-07-21 |
这就是我想要的:
| groupcol | namecol | datecol | publishcol | keep |
| -------- | ------- | ---------- | ---------- | ---- |
| A | Bob | 2020-01-01 | 2020-01-01 | yes |
| A | Ralph | 2020-01-01 | 2020-01-04 | no |
| B | Carl | 2020-04-04 | 2020-04-04 | no |
| B | Joe | 2020-04-04 | 2020-05-05 | no |
| B | Fred | 2020-03-04 | 2020-07-21 | yes |
现在这就是我正在做的:
test = pd.DataFrame({"groupcol": ["A", "A", "B", "B", "B"],
"namecol": ["Bob", "Ralph", "Carl", "Joe", "Fred"],
"datecol": ["2020-01-01", "2020-01-01", "2020-04-04", "2020-04-04", "2020-03-04"],
"publishcol": ["2020-01-01", "2020-01-04", "2020-04-04", "2020-05-05", "2020-07-21"]
})
# get min based off datecol
test['checkone']=np.where(
test.datecol == test.groupby('groupcol')['datecol'].transform(min),
'want','drop')
# get min based off publish col
test['checktwo']=np.where(
test.publishcol == test.groupby(['groupcol', 'datecol'])['publishcol'].transform(min),
'want','drop')
# get the final col
test['keep'] = np.where((test.checkone == "want") & (test.checktwo == "want"),
"yes", "no")
这让我得到了我想要的,但这似乎是一种乏味的方式。有没有更好的方法来做到这一点?
【问题讨论】:
标签: python pandas dataframe numpy pandas-groupby