在 Pandas Dataframe Python 中将多个观察特征转换为单个观察特征答案

【问题标题】：Transforming multiple observational feature to single observational feature in Pandas Dataframe Python在 Pandas Dataframe Python 中将多个观察特征转换为单个观察特征
【发布时间】：2020-07-21 00:02:54
【问题描述】：

我有一个数据框，其中包含母亲 ID 和列 (preDiabetes) 的多个观察结果：

    ChildID   MotherID   preDiabetes
0     20      455        No
1     20      455        Not documented
2     13      102        NaN
3     13      102        Yes
4     702     946        No
5     82      571        No
6     82      571        Yes
7     82      571        Not documented

我想将多个观察特征（糖尿病前期）转换为一个对每个 MotherID 进行单一观察的特征。

为此，我将创建一个具有 newPreDiabetes 功能的新数据框，并且：

如果 preDiabetes=="Yes" 为特定 MotherID 分配 newPreDiabetes 值“Yes”，而不管剩余的观察结果如何
。否则，如果特定 MotherID 的 preDiabetes != "Yes"，我将为 newPreDiabetes 分配 "No" 值

因此，我的新数据框将对特征 preDiabetes 和唯一的 MotherID 进行单一观察：

    ChildID   MotherID   newPreDiabetes
0   20        455        No
1   13        102        Yes
2   702       946        No
3   82        571        Yes

我是 Python 和 Pandas 的新手，所以我不确定实现这一目标的最佳方法是什么，但这是我迄今为止尝试过的：


    # get list of all unique mother ids
    uniqueMotherIds = pd.unique(df[['MotherID']].values.ravel())
    
    # create new dataframe that will contain unique MotherIDs and single observations for newPreDiabetes
    newDf = {'MotherID','newPreDiabetes' }
    
    # iterate through list of all mother ids and look for preDiabetes=="Yes"
    for id in uniqueMotherIds:
        filteredDf= df[df['MotherID'] == id].preDiabetes=="Yes"
        result = pd.concat([filteredDf, newDf])

代码尚未完成，如果我不确定我是否走在正确的轨道上，我将不胜感激！

非常感谢:)

【问题讨论】：

标签： python pandas

【解决方案1】：

df = pd.DataFrame({
        'MotherID': [455, 455,102,102,946,571,571,571],
        'preDiabetes' : ['No','Not documented', np.NaN,
                         'Yes', 'No','No','Yes','Not documented'],
        'ChildID' : [20,20,13,13,702,82,82,82]                   
                   })

result = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(list).reset_index()
result['newPreDiabetes'] = result['preDiabetes'].apply(
    lambda x: 'Yes' if 'Yes' in x else 'No')
result = result.drop(columns=['preDiabetes'])

输出：


   MotherID ChildID newPreDiabetes
0   102     13      Yes
1   455     20      No
2   571     82      Yes
3   946     702     No

【讨论】：

这行得通，谢谢！我有另一列（ChildID），我没有包含在上面的 MWE 中，我如何在操作它时将它作为输出的一部分包含在内？