我想通过任意过程从熊猫数据集中创建新功能答案

【问题标题】：I want to create new features from a pandas dataset by an arbitrary process我想通过任意过程从熊猫数据集中创建新功能
【发布时间】：2022-06-22 16:36:47
【问题描述】：

目前正在使用以下数据集。

import pandas as pd
import io

csv_data = '''
ID,age,get_sick,year
4567,76,0,2014
4567,78,0,2016
4567,79,1,2017
12168,65,0,2014
12168,68,0,2017
12168,69,0,2018
12168,70,1,2019
20268,65,0,2014
20268,66,0,2015
20268,67,0,2016
20268,68,0,2017
20268,69,1,2018
22818,65,0,2008
22818,73,1,2016
'''
df = pd.read_csv(io.StringIO(csv_data), index_col=['ID', 'age'])

           get_sick  year
ID    age                
4567  76          0  2014
      78          0  2016
      79          1  2017
12168 65          0  2014
      68          0  2017
      69          0  2018
      70          1  2019
20268 65          0  2014
      66          1  2015
      67          1  2016
      68          1  2017
      69          1  2018
22818 65          0  2008
      73          1  2016

对于每个人，如果该人在体检时的年龄、测量的年份以及该人曾经患过疾病，则 get_sick 为 1。

我们现在正试图建立一个模型来预测 get_sick=0 的人将来患疾病的可能性。

我们要检查 get_sick=0 的人在 5 年内是否从 0 变为 1，如果是，我们想将 1 存储在新列 'history' 中，如果从 0 变为 0，我们要存储 0。

我们只针对 get_sick=0 的数据，因为 get_sick=1 的数据不用于训练。

试过

N = 3
idx = df.groupby('ID').apply(lambda x: x.query("(year - @x.year.min()) <= @N")['get_sick'].max())
df_1 = df.reset_index().assign(history=df.reset_index()['ID'].map(idx)).set_index(['ID', 'age'])
df_1

这个过程没有给我们理想的处理，因为我们只比较了第一年。

理想的输出结果如下

           get_sick  year  history
ID    age                
4567  76          0  2014       1
      78          0  2016       1
      79          1  2017     Nan
12168 65          0  2014       1
      68          0  2017       1
      69          0  2018       1
      70          1  2019     Nan
20268 65          0  2014       1
      66          1  2015     Nan
      67          1  2016     Nan
      68          1  2017     Nan
      69          1  2018     Nan
22818 65          0  2008       0
      73          1  2016     Nan

如果有人熟悉 Pandas 操作，如果您能告诉我，我将不胜感激。

提前谢谢你。

【问题讨论】：

ID 20268 在 csv 数据中只有一条带有 get_sick = 1 的记录，但是在您的数据框中有多条带有 get_sick = 1 的记录

标签： python pandas

【解决方案1】：

首先我创建了一个列，其中包含get_sick = 1 的年份。

df_mer = df[df.get_sick == 1].reset_index()[['ID', 'year']]

df = df.reset_index().merge(df_mer, on = 'ID', suffixes=('', '_max'))

然后您可以使用year_max 计算年份差并分配一个 1/0。

df.loc[(df.get_sick == 0) & (df.year_max - df.year <= 5), 'history'] = 1
df.loc[(df.get_sick == 0) & (df.year_max - df.year > 5), 'history'] = 0

df = df.set_index(['ID', 'age']).drop(columns='year_max')

输出：

           get_sick  year  history
ID    age                         
4567  76          0  2014      1.0
      78          0  2016      1.0
      79          1  2017      NaN
12168 65          0  2014      1.0
      68          0  2017      1.0
      69          0  2018      1.0
      70          1  2019      NaN
20268 65          0  2014      1.0
      66          0  2015      1.0
      67          0  2016      1.0
      68          0  2017      1.0
      69          1  2018      NaN
22818 65          0  2008      0.0
      73          1  2016      NaN

【讨论】：