熊猫分组并找到所有列的第一个非空值答案

【问题标题】：pandas group by and find first non null value for all columns熊猫分组并找到所有列的第一个非空值
【发布时间】：2020-03-21 17:12:59
【问题描述】：

我有熊猫DF如下，

id  age   gender  country  sales_year
1   None   M       India    2016
2   23     F       India    2016
1   20     M       India    2015
2   25     F       India    2015
3   30     M       India    2019
4   36     None    India    2019

我想按 id 分组，根据 sales_date 取最新的 1 行，所有非空元素。

预期输出，

id  age   gender  country  sales_year
1   20     M       India    2016
2   23     F       India    2016
3   30     M       India    2019
4   36     None    India    2019

在 pyspark 中，

df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))

但我在 pandas 中需要相同的解决方案。

编辑 :: 这可以适用于所有列。不仅仅是年龄。我需要它来获取所有 id 的最新非空数据（id 存在）。

【问题讨论】：

你的输出仍然包含 None 值，除非我遗漏了什么
如果行中没有任何有效数据，则 None 很好。但如果可用，它应该检测示例中 id 1 的方式，年龄从第二个替换最高年份数据。

标签： python pandas group-by pyspark window

【解决方案1】：

使用GroupBy.first:

df1 = df.groupby('id', as_index=False).first()
print (df1)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

如果sales_year 列未排序：

df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
   id   age gender country  sales_year
0   1  20.0      M   India        2016
1   2  23.0      F   India        2016
2   3  30.0      M   India        2019
3   4  36.0    NaN   India        2019

【讨论】：

这是如何按照 sales_year 对数据进行排序的？我首先需要最近一年的行。
有什么办法，我可以对所有列都这样做吗？我有 20 多个这样的列。我希望一次性完成所有这些..因为排序仍然在同一列“sales_year”上，分组依据在“id”列上也保持不变。
@j' - 看来您想要原来的排序解决方案？因为first 在所有列中返回第一个非None 或不是NaNs 值，所以这是因为2016 在第一行。

【解决方案2】：

使用 -

df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()

输出

id
1    20
2    23
3    30
4    36
Name: age, dtype: object

删除['age'] 以获得完整的行 -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()

输出

   age gender country  sales_year
id                               
1   20      M   India        2015
2   23      F   India        2016
3   30      M   India        2019
4   36   None   India        2019

您可以将id 作为一列放回reset_index() -

df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()

输出

   id age gender country  sales_year
0   1  20      M   India        2015
1   2  23      F   India        2016
2   3  30      M   India        2019
3   4  36   None   India        2019

【讨论】：

在此处标识列而不是索引。改不了
添加了reset_index()@j'
另外，df.dropna() 将删除所有至少有 1 个无值的行。我不希望这样。我在这个 DF 中还有 20 列。给定的解决方案不起作用
您可以使用subset 参数指定dropna() 中的列子集。更新了请检查

【解决方案3】：

print(df.replace('None',np.NaN).groupby('id').first())

首先将“None”替换为 NaN
接下来使用 groupby() 按 'id' 分组
接下来使用 first() 过滤掉第一行

【讨论】：