从 Pandas Dataframe 中的第一个非 NaN 条目派生日期答案

【问题标题】：Derive Date from first non NaN entry in Pandas Dataframe从 Pandas Dataframe 中的第一个非 NaN 条目派生日期
【发布时间】：2021-12-13 08:34:36
【问题描述】：

我对 pandas 和 python 很陌生，现在面临一个大问题。我有一个数据框，其中包含在特定年份的特定月份为某些客户支付的款项。

Customer	A	A	B	C
Year of payment	2020	2021	2021	2020
january	NaN	14	NaN	NaN
february	NaN	20	30	NaN
march	20	NaN	30	NaN
etc	NaN	5	30	NaN

有时每个客户（如客户 A）列出几年，有时没有。有时特定年份只有 NaN 值。

我需要了解每位客户的第一笔付款何时完成。结果应该是这样的。

Customer	A	B	C
first payment	march 2020	february 2021	-

昨天我已经尝试广泛解决这个问题几个小时，但甚至没有接近找到解决方案。如果有人能指出我正确的方向，那就太棒了:)

编辑：以下是数据框的详细信息：

Index(['January__c', 'February__c', 'March__c', 'April__c', 'May__c', '六月__c'，'七月__c'，'八月__c'，'九月__c'，'十月__c'， '十一月__c', '十二月__c'], dtype='对象') MultiIndex：7369 个条目，('a1k06000004DjDdAAK', '2021') 到 ('a1k1o000006NRP4AAO', '2021.0')

数据列（共12列）：

#	Column	Non-Null Count	Dtype
0	January__c	1810 non-null	float64
1	February__c	2207 non-null	float64
2	March__c	2614 non-null	float64
3	April__c	2991 non-null	float64
4	May__c	3328 non-null	float64
5	June__c	3789 non-null	float64
6	July__c	4208 non-null	float64
7	August__c	4583 non-null	float64
8	September__c	4757 non-null	float64
9	October__c	2515 non-null	float64
10	November__c	1345 non-null	float64
11	December__c	2193 non-null	float64

dtypes: float64(12) 内存使用量：879.9+ KB 无

【问题讨论】：

print (df.columns) 是什么？
print (df.info()) 是什么？
当您的数据采用整洁的数据格式时，您会发现数据分析更容易jeannicholashould.com/tidy-data-in-python.html

标签： python pandas dataframe

【解决方案1】：

使用DataFrame.stack 进行整形，将月份与年份转换为DatetimeIndex，整形以删除NaNs 行并获得每个Customer 的最短日期时间：

d = {('A', '2020'): {'january__c': np.nan, 'february__c': np.nan, 'march__c': 20.0}, 
     ('A', '2021'): {'january__c': 14.0, 'february__c': 20.0, 'march__c': np.nan}, 
     ('B', '2021'): {'january__c': np.nan, 'february__c': 30.0, 'march__c': 30.0}, 
     ('C', '2020'): {'january__c': np.nan, 'february__c': np.nan, 'march__c': np.nan}}
df = pd.DataFrame(d).rename_axis(['Customer','Year of payment'], axis=1)

print (df)
Customer            A           B    C
Year of payment  2020  2021  2021 2020
january__c        NaN  14.0   NaN  NaN
february__c       NaN  20.0  30.0  NaN
march__c         20.0   NaN  30.0  NaN

df = df.stack()
df.index = pd.to_datetime(df.index.map(lambda x: f'{x[0]} {x[1]}'), format='%B__c %Y')

s = (df.stack()
       .reset_index()
       .groupby('Customer')['level_0'].min()
       .dt.strftime('%B %Y')
       .reindex(df.columns.unique()))

df = s.rename('first payment').to_frame().T
print (df)
Customer                A              B    C
first payment  March 2020  February 2021  NaN

【讨论】：

感谢您的回答！不幸的是，我昨天没有让它工作。它返回错误 'ValueError: time data 'Customer 0' does not match format '%B %Y' (match)'。
@sonnen_flo - 你能测试一下format='%B__c %Y') 吗？
不，它不起作用。我认为问题可能是 lambda 函数没有指向正确的字符串。 df.stack() 返回以下内容： Customer 0 A 1 B 2 C 3 D ... december 6993 353.03 6995 23.47 7001 119.07
@sonnen_flo - 添加了有问题的示例数据，对我来说工作得很好。如果错误'ValueError: time data 'Customer 0' does not match format '%B %Y' (match)' 表示月份格式不同，例如january__c、february__c ...
是的，您的示例有效，谢谢。但不幸的是，对于我的数据，它不起作用。我认为这与数据框的格式有关。我发现，我没有正确分配索引，而我现在有。 df.columns 和 df.index 返回与示例中完全相同的结果。但是现在，当我使用 stack() 时，错误返回“无法从重复轴重新索引”