在 Pandas DataFrame 中定位第一个和最后一个非 NaN 值答案

【问题标题】：Locate first and last non NaN values in a Pandas DataFrame在 Pandas DataFrame 中定位第一个和最后一个非 NaN 值
【发布时间】：2014-04-19 15:41:40
【问题描述】：

我有一个按日期索引的 Pandas DataFrame。有许多列，但许多列仅填充时间序列的一部分。我想找到非NaN 值的第一个和最后一个值的位置，以便我可以提取日期并查看特定列的时间序列有多长。

有人能指出我如何去做这样的事情吗？提前致谢。

【问题讨论】：

first_valid_index 和 last_valid_index
谢谢@behzad.nouri 这正是我想要的！
当缺失值可能为“0”时，是否有解决方案？（即找到第一个非零值，每个组/时间序列）？

标签： python datetime pandas

【解决方案1】：

@behzad.nouri 的解决方案完美地返回了第一个和最后一个非NaN values，分别使用Series.first_valid_index 和Series.last_valid_index。

【讨论】：

@KorayTugay 熊猫系列是一维的（即单列）。如果您想检查 df 中的更多列，您可以遍历您的 df。例如。 for col_name, data in df.items(): print("First valid index for column {} is at {}".format(col_name, data.first_valid_index()))
您可以使用df.apply(Series.first_valid_index)，而不是遍历DataFrame的列。

【解决方案2】：

这里有一些有用的例子。

系列

s = pd.Series([np.NaN, 1, np.NaN, 3, np.NaN], index=list('abcde'))
s

a    NaN
b    1.0
c    NaN
d    3.0
e    NaN
dtype: float64

# first valid index
s.first_valid_index()
# 'b'

# first valid position
s.index.get_loc(s.first_valid_index())
# 1

# last valid index
s.last_valid_index()
# 'd'

# last valid position
s.index.get_loc(s.last_valid_index())
# 3

使用notna 和idxmax 的替代解决方案：

# first valid index
s.notna().idxmax()
# 'b'

# last valid index
s.notna()[::-1].idxmax()
# 'd'

数据帧

df = pd.DataFrame({
    'A': [np.NaN, 1, np.NaN, 3, np.NaN], 
    'B': [1, np.NaN, np.NaN, np.NaN, np.NaN]
})
df

     A    B
0  NaN  1.0
1  1.0  NaN
2  NaN  NaN
3  3.0  NaN
4  NaN  NaN

(first|last)_valid_index 未在 DataFrame 上定义，但您可以使用 apply 将它们应用于每一列。

# first valid index for each column
df.apply(pd.Series.first_valid_index)

A    1
B    0
dtype: int64

# last valid index for each column
df.apply(pd.Series.last_valid_index)

A    3
B    0
dtype: int64

和以前一样，您也可以使用notna 和idxmax。这是更自然的语法。

# first valid index
df.notna().idxmax()

A    1
B    0
dtype: int64

# last valid index
df.notna()[::-1].idxmax()

A    3
B    0
dtype: int64

【讨论】：

idxmax() 的问题是它会为完整的NaN 列返回0。在这种情况下，我希望NaN，所以我宁愿总是使用.apply(Series.first_valid_index)。

【解决方案3】：

基于 behzad.nouri 的推荐和 cs95 的较早答案的便利功能。任何错误或误解都是我的。

import pandas as pd
import numpy as np

df = pd.DataFrame([["2022-01-01", np.nan, np.nan, 1], ["2022-01-02", 2, np.nan, 2], ["2022-01-03", 3, 3, 3], ["2022-01-04", 4, 4, 4], ["2022-01-05", np.nan, 5, 5]], columns=['date', 'A', 'B', 'C'])
df['date'] = pd.to_datetime(df['date'])

df
#        date    A    B    C
#0 2022-01-01  NaN  NaN  1.0
#1 2022-01-02  2.0  NaN  2.0
#2 2022-01-03  3.0  3.0  3.0
#3 2022-01-04  4.0  4.0  4.0
#4 2022-01-05  NaN  5.0  5.0

我们希望从 A 和 B 共同的最早日期开始，并在 A 和 B 共同的最晚日期结束（无论出于何种原因，我们不按 C 列过滤）。

# filter data to minimum/maximum common available dates
def get_date_range(df, cols):
    """return a tuple of the earliest and latest valid data for all columns in the list"""
    a,b = df[cols].apply(pd.Series.first_valid_index).max(), df[cols].apply(pd.Series.last_valid_index).min()
    return (df.loc[a, 'date'], df.loc[b, 'date'])

a,b = get_date_range(df, cols=['A', 'B'])
a
#Timestamp('2022-01-03 00:00:00')
b
#Timestamp('2022-01-04 00:00:00')

现在过滤数据：

df.loc[(df.date >= a) & (df.date <= b)]
#        date    A    B    C
#2 2022-01-03  3.0  3.0  3
#3 2022-01-04  4.0  4.0  4

【讨论】：