【问题标题】:backfill pandas dataframe column using a condition使用条件回填熊猫数据框列
【发布时间】:2019-06-09 01:26:38
【问题描述】:

我有一个包含 5000 万条记录的 pandas 数据框,我想做的是根据条件回填。正如我们所见,名称 800A 和 Barber 的时间戳对齐,因此我假设数据属于同一个名称,并且在记录数据时只是一个错误。米娅的名字也是如此。

这只是示例数据。

我的数据框看起来像这样。

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

现在我正在尝试回填记录编号(rec_num)列,直到它在 x_inc 和 vs_inc 列中映射布尔条件 False False。

实际输出:

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE 7192783 01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE 7192783 01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783 01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783 01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783 01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE 6728923 03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE 6728923 03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923 03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923 03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923 03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

预期输出:

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783 01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783 01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783 01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923 03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923 03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923 03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

我正在使用df['rec_num'].fillna(method='bfill'),但它已完全填满,这不是我理想的解决方案。如果我能得到任何关于这个问题的建议(或者如果有更好的方法),我将不胜感激。提前致谢。

【问题讨论】:

  • 你能上传一个csv(数据集)来重现吗?您可以在问题中添加链接(Gdrive 共享)...
  • 您能更清楚地了解您的期望吗?例如:仅在 x_inc 和 vs_inc 字段不为 FALSE 时填充 rec_num(column)?
  • 你好安德烈。所以基本上我想在 HR = True 和 SP = True 或 HR=False 和 SP = True 或 HR = True 和 SP = False 但不是 HR = False 和 SP = False 时回填记录号。如果它回答了您的问题,请告诉我。
  • 完美。我明白!只是最后一个疑问。字段 rec_num 是 datetime 列的时间戳?

标签: python-3.x pandas dataframe data-manipulation


【解决方案1】:

使用布尔掩码和np.where(),您可以这样使用:

cond=(df.x_inc == False) & (df.vs_inc == False) #creates a boolean mask where both columns are false
df['new_rec']=np.where(~cond,df.rec_num.bfill(),df.rec_num) #does a backfill on where condition is not met
print(df)

注意: 您可以将值重新分配给名为 rec_num 的旧列,而不是创建新列。我添加了,所以你可以比较。这也应该是自矢量化以来最快的方法

    datetime            name    dischargeDate       HR      Sp      x_inc   vs_inc  rec_num     new_rec
0   2019-05-01 18:04:50 Zawisza 2005-01-14 18:05:00 119.0   98.0    False   True    6458445.0   6458445.0
1   2019-05-01 18:04:55 Zawisza 2005-01-14 18:05:00 120.0   97.0    False   True    6458445.0   6458445.0
2   2019-05-01 18:05:00 Zawisza 2005-01-14 18:05:00 NaN     NaN     False   False   NaN         NaN
3   2029-01-01 17:58:45 800A    2029-01-14 17:59:10 NaN     NaN     False   False   NaN         NaN
4   2029-01-01 17:58:50 800A    2029-01-14 17:59:10 139.0   NaN     False   True    NaN         7192783.0
5   2029-01-01 17:58:55 800A    2029-01-14 17:59:10 138.0   NaN     False   True    NaN         7192783.0
...........................................................
...........................................................
....................................................
.....................................

【讨论】:

  • @AbalanMusk 不要忘记滚动并检查解决方案的最后一列 ;)
【解决方案2】:

您可以使用申请

创建函数:

def foo(x):
    if not bool(x['epic_include']) and not bool(x['vs_include']):
        return None
    else:
        if not pd.isna(x['twist_mrn']):
            return x['twist_mrn']
        else:
            return df['twist_mrn'].iloc[df.iloc[x.name:]['twist_mrn'].first_valid_index()]

所以,申请:

df['twist_mrn'] = df.apply(foo, axis=1)

输出:

    datetime    patient_name    dischargeDate   HR  SpO2    epic_include    vs_include  twist_mrn
0   2014-01-05 18:04:50     Zawisza     2014-01-05 18:05:00     119.0   98.0    False   True    4654843.0
1   2014-01-05 18:04:55     Zawisza     2014-01-05 18:05:00     120.0   97.0    False   True    4654843.0
2   2014-01-05 18:05:00     Zawisza     2014-01-05 18:05:00     NaN     NaN     False   False   NaN
3   2014-01-29 17:58:45     800A    2014-01-29 17:59:10     NaN     NaN     False   False   NaN
4   2014-01-29 17:58:50     800A    2014-01-29 17:59:10     139.0   NaN     False   True    4719848.0
5   2014-01-29 17:58:55     800A    2014-01-29 17:59:10     138.0   NaN     False   True    4719848.0
6   2014-01-29 17:59:00     800A    2014-01-29 17:59:10     138.0   96.0    False   True    4719848.0
7   2014-01-29 17:59:05     800A    2014-01-29 17:59:10     138.0   96.0    False   True    4719848.0
8   2014-01-29 17:59:10     800A    2014-01-29 17:59:10     138.0   96.0    False   True    4719848.0
9   2014-01-29 17:59:15     Barber  2014-01-29 18:17:15     138.0   96.0    False   True    4719848.0
10  2014-01-29 17:59:20     Barber  2014-01-29 18:17:15     138.0   96.0    False   True    4719848.0
11  2014-01-29 17:59:25     Barber  2014-01-29 18:17:15     138.0   95.0    False   True    4719848.0
12  2015-03-04 21:19:45     800A    2015-03-05 01:00:15     NaN     NaN     False   False   NaN
13  2015-03-05 00:53:10     800A    2015-03-05 01:00:15     NaN     NaN     False   False   NaN
14  2015-03-05 00:55:40     800A    2015-03-05 01:00:15     NaN     95.0    False   True    4163407.0
15  2015-03-05 00:55:45     800A    2015-03-05 01:00:15     NaN     95.0    False   True    4163407.0
16  2015-03-05 00:55:50     800A    2015-03-05 01:00:15     NaN     94.0    False   True    4163407.0
17  2015-03-05 00:55:55     800A    2015-03-05 01:00:15     81.0    93.0    False   True    4163407.0

【讨论】:

  • 这一行 df['twist_mrn'].iloc[df.iloc[x.name:]['twist_mrn'].first_valid_index()] 从特定行(x .name 返回索引)。
猜你喜欢
  • 2016-12-11
  • 2022-12-11
  • 1970-01-01
  • 2018-06-20
  • 2019-04-28
  • 2019-05-04
  • 2018-11-02
  • 2023-01-26
  • 1970-01-01
相关资源
最近更新 更多