在此数据框中填充缺失值的最有效方法是什么？答案

【问题标题】：What is the most efficient way to fill missing values in this data frame?在此数据框中填充缺失值的最有效方法是什么？
【发布时间】：2020-10-30 06:02:48
【问题描述】：

我有以下熊猫数据框：

df = pd.DataFrame([
    ['A', 2017, 1],
    ['A', 2019, 1],
    ['B', 2017, 1],
    ['B', 2018, 1],
    ['C', 2016, 1],
    ['C', 2019, 1],
], columns=['ID', 'year', 'number'])

并且正在寻找最有效的方法来填充缺失年份，该列的默认值为 0 number

预期的输出是：

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

我拥有的数据框比较大，所以我正在寻找一个有效的解决方案。

编辑：

这是我目前的代码：

min_max_dict = df[['ID', 'year']].groupby('ID').agg([min, max]).to_dict('index')

new_ix = [[], []]
for id_ in df['ID'].unique():
    for year in range(min_max_dict[id_][('year', 'min')], min_max_dict[id_][('year', 'max')]+1): 
        new_ix[0].append(id_)
        new_ix[1].append(year)


df.set_index(['ID', 'year'], inplace=True)
df = df.reindex(new_ix, fill_value=0).reset_index()

结果

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

【问题讨论】：

可能重复，stackoverflow.com/a/19324591/4985099
@Sushanth 我最初是这么认为的，但不完全是，有一个问题 - ID a，2016 不应该插入，只有以后的年份应该插入。
@Sushanth 问题在于我有多个 ID 和多个不同的日期范围（ID A 是 2017-2019 年，ID B 是 2017-2018 年）
@SebastienD 我已经编辑了原帖

标签： python pandas

【解决方案1】：

这可行，但会为“B”创建一个“2019”条目：

df.pivot(index='ID', columns='year', values='number').fillna(0).stack().to_frame('number')

返回：

    number
ID  year    
A   2016    0.0
2017    1.0
2018    0.0
2019    1.0
B   2016    0.0
2017    1.0
2018    1.0
2019    0.0
C   2016    1.0
2017    0.0
2018    0.0
2019    1.0

【讨论】：

不幸的是，我不是在寻找这样的解决方案，因为它假定每个 ID 都是相同的。它不仅包括 B 的 2019 年，还包括 A 的 2016 年

【解决方案2】：

这是一种方法：

letter_keys = df.ID.unique()
data = df.values
missing_records = []
for letter in letter_keys:
    print(letter)
    years = [x[1] for x in data if x[0] == letter]
    min_year = min(years)
    max_year = max(years)
    current_year = min_year
    while current_year<max_year:
        if current_year not in years:
            missing_records.append([letter, current_year,0])
            print('missing', current_year)
        current_year +=1

new_df = df.append(pd.DataFrame(missing_records, columns = df.columns)).sort_values(['ID','year'])

输出

| ID   |   year |   number |
|:-----|-------:|---------:|
| A    |   2017 |        1 |
| A    |   2018 |        0 |
| A    |   2019 |        1 |
| B    |   2017 |        1 |
| B    |   2018 |        1 |
| C    |   2016 |        1 |
| C    |   2017 |        0 |
| C    |   2018 |        0 |
| C    |   2019 |        1 |

【讨论】：

【解决方案3】：

t = df.groupby('ID')['year'].agg(['min','max']).reset_index()
t['missing'] = t.transform(lambda x: [y for y in range(x['min'], x['max']+1) if y not in x.values], axis=1)
t = t[['ID','missing']].explode('missing').dropna()
t['number'] = 0
t.columns = ['ID','year','number']
pd.concat([df,t]).sort_values(by=['ID','year'])

输出

    ID  year    number
0   A   2017    1
0   A   2018    0
1   A   2019    1
2   B   2017    1
3   B   2018    1
4   C   2016    1
2   C   2017    0
2   C   2018    0
5   C   2019    1

【讨论】：

【解决方案4】：

这是reindex的另一种方法

u = df.groupby('ID')['year'].apply(lambda x: range(x.min(),x.max()+1)).explode()

out = (df.set_index(['ID','year']).reindex(u.reset_index().to_numpy(),fill_value=0)
         .reset_index())

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

【讨论】：

【解决方案5】：

比使用explode 稍微快一点的方法是使用 pd.Series 构造函数。如果年份已经从最早到最晚排序，您可以使用 .iloc。

idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()

输出：

  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

【讨论】：

@anky 在这个例子中，explode 17.2 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 和 pd.Series 构造函数 13.1 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

【解决方案6】：

您可以尝试使用date_range 和pd.merge：

g = df.groupby("ID")["year"].agg({"min":"min","max":"max"}).reset_index()
id_years = pd.DataFrame(list(g.apply(lambda row: list(row["ID"]) + 
                    list(pd.date_range(start=f"01/01/{row['min']}", \
                    end=f"01/01/{row['max']+1}",freq='12M').year), axis=1))).melt(0).dropna()[[0,"value"]]

id_years.loc[:,"value"] = id_years["value"].astype(int)
id_years = id_years.rename(columns = {0:"ID","value":'year'})
id_years = id_years.sort_values(["ID","year"]).reset_index(drop=True)

## Merge two dataframe
output_df = pd.merge(id_years, df, on=["ID","year"], how="left").fillna(0)
output_df.loc[:,"number"] = output_df["number"].astype(int)
output_df

输出：

    ID  year    number
0   A   2017    1
1   A   2018    0
2   A   2019    1
3   B   2017    1
4   B   2018    1
5   C   2016    1
6   C   2017    0
7   C   2018    0
8   C   2019    1

【讨论】：

【解决方案7】：

这是一种使用lambda 避免任何缓慢应用的方法。从某种意义上说，这是一种内存效率低下的解决方案，因为我们创建了基础 DataFrame，它是 DataFrame 中所有 ID 和年份范围的交叉产品。更新后，我们可以有效地使用布尔掩码将其细分为您需要的时段。掩码是通过cummax 在正向和反向检查中创建的。

如果大多数 ID 跨越相同的一般年份范围，则在从产品创建基本 DataFrame 方面不会有太多浪费。如果你想要更高的性能，有很多关于more efficient ways to do a cross-product的帖子

def Alollz(df):
    idx = pd.MultiIndex.from_product([np.unique(df['ID']), 
                                      np.arange(df['year'].min(), df['year'].max()+1)],
                                     names=['ID', 'year'])
   
    df_b = pd.DataFrame({'number': 0}, index=idx)
    df_b.update(df.set_index(['ID', 'year']))
    
    m = (df_b.groupby(level=0)['number'].cummax().eq(1) 
         & df_b[::-1].groupby(level=0)['number'].cummax().eq(1))
    
    return df_b.loc[m].reset_index()

Alollz(df)

  ID  year  number
0  A  2017     1.0
1  A  2018     0.0
2  A  2019     1.0
3  B  2017     1.0
4  B  2018     1.0
5  C  2016     1.0
6  C  2017     0.0
7  C  2018     0.0
8  C  2019     1.0

这肯定比其他一些提案要多得多的代码。但要看看它真正的亮点，让我们创建一些具有 50K ID 的虚拟数据（这里我将让所有日期范围都相同，只是为了创建测试数据的简单性）。

N = 50000
df = pd.DataFrame({'ID': np.repeat(range(N), 2),
                   'year': np.tile([2010,2018], N),
                   'number': 1})

#@Scott Boston's Answer
def SB(df):
    idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
    df = df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
    return df

# Make sure they give the same output:
(Alollz(df) == SB(df)).all().all()
#True

%timeit Alollz(df)
#1.9 s ± 73.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit SB(df)
#10.8 s ± 539 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

所以这大约快了 5 倍，这是一个相当大的问题，因为事情只需要几秒钟。

【讨论】：

【解决方案8】：

我们可以使用来自pyjanitor 的complete 函数，它提供了一种方便的抽象来生成丢失的行，在这种情况下每个ID 组：

# pip install pyjanitor
import pandas as pd
import janitor as jn

# create mapping for range of years
years = dict(year = lambda year: range(year.min(), year.max() + 1))

# apply the complete function per group and fill the nulls with 0

df.complete(years, by = 'ID', sort = True).fillna(0, downcast = 'infer')
 
  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

但是，by 主要是为了方便；在某些情况下，做更多的工作可能会更有效率；类似于@Alollz 的解决方案：


# get the mapping for the year for the entire dataframe
years = dict(year =  range(df.year.min(), df.year.max() + 1))

# create a groupby
group = df.groupby('ID').year

#  assign the max and min years to the dataframe
(df.assign(year_max = group.transform('max'), 
           year_min = group.transform('min'))
     # run complete on the entire dataframe, without `by`
    # note that ID, year_min, year_max are grouped together
    # think of it as a DataFrame of just these three columns
    # combined with years .. we are not modifying these three columns
    # only the years 
   .complete(years, ('ID', 'year_min', 'year_max'))
    # filter rows where year is between max and min
   .loc[lambda df: df.year.between(df.year_min, df.year_max), 
        df.columns]
    # sort the values and fillna
   .sort_values([*df], ignore_index = True)
   .fillna(0, downcast = 'infer')
)
 
  ID  year  number
0  A  2017       1
1  A  2018       0
2  A  2019       1
3  B  2017       1
4  B  2018       1
5  C  2016       1
6  C  2017       0
7  C  2018       0
8  C  2019       1

使用@Allolz 的样本数据：

N = 50000
df = pd.DataFrame({'ID': np.repeat(range(N), 2),
                   'year': np.tile([2010,2018], N),
                   'number': 1})

def complete_sam(df):
    years = dict(year =  range(df.year.min(), df.year.max() + 1))
    group = df.groupby('ID').year
    outcome = (df.assign(year_max = group.transform('max'),
                         year_min = group.transform('min'))
                 .complete(years, ('ID', 'year_min', 'year_max'))
                 .loc[lambda df: df.year.between(df.year_min, 
                                                 df.year_max),
                     df.columns]
                 .sort_values([*df], ignore_index = True)
                 .fillna(0)
              )
    return outcome

#@Scott Boston's Answer
def SB(df):
    idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
    df = df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
    return df

#@Alollz's answer
def Alollz(df):
    idx = pd.MultiIndex.from_product([np.unique(df['ID']), 
                                      np.arange(df['year'].min(), df['year'].max()+1)],
                                     names=['ID', 'year'])
   
    df_b = pd.DataFrame({'number': 0}, index=idx)
    df_b.update(df.set_index(['ID', 'year']))
    
    m = (df_b.groupby(level=0)['number'].cummax().eq(1) 
         & df_b[::-1].groupby(level=0)['number'].cummax().eq(1))
    
    return df_b.loc[m].reset_index()

In [310]: Alollz(df).equals(complete_sam(df))
Out[310]: True

In [311]: %timeit complete_sam(df)
268 ms ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [312]: %timeit Alollz(df)
1.84 s ± 58.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [316]: SB(df).eq(complete_sam(df)).all().all()
Out[316]: True

In [317]: %timeit SB(df)
6.13 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】：