pandas 按列分组，查找多列的最小值，并为组中的最小行创建新列答案

【问题标题】：pandas group by a column, find min value of multiple columns, and create new column for the min row in grouppandas 按列分组，查找多列的最小值，并为组中的最小行创建新列
【发布时间】：2021-05-23 00:00:52
【问题描述】：

我正在尝试在 pandas 数据框中创建一个新列。如果该行的datecol 列是最早日期，我希望该列具有特定groupcol 值的值“是”。但是，有些组有重复的datecol 值，所以如果有最早的datecol 的重复值，我想使用publishcol 列中最早的日期。最后，基于groupcol 的每个组值应该只有一行，其中有一个值为“yes”的新列。

这就是我所拥有的：

| groupcol | namecol | datecol    | publishcol |
| -------- | ------- | ---------- | ---------- |
|  A       | Bob     | 2020-01-01 | 2020-01-01 |
|  A       | Ralph   | 2020-01-01 | 2020-01-04 |
|  B       | Carl    | 2020-04-04 | 2020-04-04 |
|  B       | Joe     | 2020-04-04 | 2020-05-05 |
|  B       | Fred    | 2020-03-04 | 2020-07-21 |

这就是我想要的：

| groupcol | namecol | datecol    | publishcol | keep |
| -------- | ------- | ---------- | ---------- | ---- |
|  A       | Bob     | 2020-01-01 | 2020-01-01 | yes  |
|  A       | Ralph   | 2020-01-01 | 2020-01-04 | no   |
|  B       | Carl    | 2020-04-04 | 2020-04-04 | no   |
|  B       | Joe     | 2020-04-04 | 2020-05-05 | no   |
|  B       | Fred    | 2020-03-04 | 2020-07-21 | yes  |

现在这就是我正在做的：

test = pd.DataFrame({"groupcol": ["A", "A", "B", "B", "B"],
             "namecol": ["Bob", "Ralph", "Carl", "Joe", "Fred"],
              "datecol": ["2020-01-01", "2020-01-01", "2020-04-04", "2020-04-04", "2020-03-04"],
              "publishcol": ["2020-01-01", "2020-01-04", "2020-04-04", "2020-05-05", "2020-07-21"]
            })

# get min based off datecol
test['checkone']=np.where(
    test.datecol == test.groupby('groupcol')['datecol'].transform(min), 
    'want','drop')

# get min based off publish col
test['checktwo']=np.where(
    test.publishcol == test.groupby(['groupcol', 'datecol'])['publishcol'].transform(min), 
    'want','drop')

# get the final col
test['keep'] = np.where((test.checkone == "want") & (test.checktwo == "want"),
"yes", "no")

这让我得到了我想要的，但这似乎是一种乏味的方式。有没有更好的方法来做到这一点？

【问题讨论】：

标签： python pandas dataframe numpy pandas-groupby

【解决方案1】：

您可以在 groupcol 和两个日期列上使用 sort_values。使用 groupcol 检查不等于 (ne) 与自身的 shift 的位置，这将在每个 groupcol 的最小值上给出 True。

s = test.sort_values(['groupcol', 'datecol', 'publishcol'])['groupcol']
test['keep'] = s.ne(s.shift())
print(test)
  groupcol namecol     datecol  publishcol   keep
0        A     Bob  2020-01-01  2020-01-01   True
1        A   Ralph  2020-01-01  2020-01-04  False
2        B    Carl  2020-04-04  2020-04-04  False
3        B     Joe  2020-04-04  2020-05-05  False
4        B    Fred  2020-03-04  2020-07-21   True

请注意，我会保留一个布尔列，而不是“是”和“否”。但如果你真的想要这些值，你仍然可以使用test['keep'] = s.ne(s.shift()).map({True:'yes', False:'no'})

【讨论】：