【问题标题】:Fill new column based on conditions defined in a string根据字符串中定义的条件填充新列
【发布时间】:2021-12-13 11:09:03
【问题描述】:

我有条件填充在字符串中定义的新列。

condition_string =  "colA='yes' & colB='yes' & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"

可以以任何其他格式(字典)重写/构造字符串,然后将其输入代码以获得最终结果。

数据框是

df = pd.DataFrame(
    {
            'ID': ['AB01', 'AB02', 'AB03', 'AB03', 'AB04','AB05', 'AB06'],
            'colA': ["yes","yes",'yes',"no","no",'yes', np.nan],
            'colB': [np.nan,'yes','yes',"no",'no', np.nan, "yes"],
            'colC': ["yes",'yes', 'yes',"no", "no",np.nan,np.nan],
            'colD': ["yes",'no', 'yes',"no",np.nan,"no",np.nan],
    }
    )

最终结果应如下所示

如果不对condition_string 中的内容进行硬编码,我如何才能完成这项工作。或者您有什么方法可以重组condition_string 然后应用于数据框?

更新: 如果字典是这样的呢?

condition_string =  "colA='yes' & (colB='yes' | colB='no)' & 
(colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' |    colD='no'): 'No', ELSE : 'UNKNOWN'"

数据框就像

df = pd.DataFrame(
    {
            'ID': ['AB01', 'AB02', 'AB03', 'AB03', 'AB04','AB05', 'AB06'],
            'colA': ["yes","yes",'yes',"no","no",'yes', np.nan],
            'colB': ["no",'yes','yes',"no",'no', np.nan, "yes"],
            'colC': ["yes",'yes', 'yes',"no", "no",np.nan,np.nan],
            'colD': ["yes",'no', 'yes',"no",np.nan,"no",np.nan]
    }
    )

【问题讨论】:

    标签: python pandas dataframe numpy data-manipulation


    【解决方案1】:

    这是一种将您的条件转换为 python 函数,然后将其应用于 DataFrame 行的解决方案:

    import re
    
    condition_string =  "colA='yes' & colB='yes' & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
    
    # formatting string as python function apply_cond
    for col in df.columns:
        condition_string = re.sub(rf"(\W|^){col}(\W|$)", rf"\1row['{col}']\2", condition_string)
        condition_string = re.sub(rf"row\['{col}'\]\s*=(?!=)", f"row['{col}']==", condition_string)
    
    cond_form = re.sub(r'(:[^[(]+), (?!ELSE)', r'\1\n\telif ', condition_string) \
                .replace(": ", ":\n\t\treturn ") \
                .replace("&", "and") \
                .replace('|', 'or')
    cond_form = re.sub(r", ELSE\s*:", "\n\telse:", cond_form)
    function_def = "def apply_cond(row):\n\tif " + cond_form
    #print(function_def) # uncomment to see how the function is defined
    
    # executing the function definition of apply_cond
    exec(function_def)
    
    # applying the function to each row
    df["result"]=df.apply(lambda x: apply_cond(x), axis=1)
    
    print(df)
    

    输出:

         ID colA colB colC colD   result
    0  AB01  yes  NaN  yes  yes  UNKNOWN
    1  AB02  yes  yes  yes   no      Yes
    2  AB03  yes  yes  yes  yes      Yes
    3  AB03   no   no   no   no       No
    4  AB04   no   no   no  NaN       No
    5  AB05  yes  NaN  NaN   no  UNKNOWN
    6  AB06  NaN  yes  NaN  NaN  UNKNOWN
    

    您可能希望根据 condition_string 调整字符串格式(我很快就做到了,可能有一些不受支持的组合),但如果您自动获取这些字符串,它将避免您重新定义它们。

    【讨论】:

    • 如果字典更新为上述格式怎么办?可以使其适用于上述更新的数据框场景吗?
    • 如果我将condition_string 更改为condition_string = "colA='yes' & colB in ['yes','no'] & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'" function_def 不会返回row 附加到colB
    • 在您的问题中,第二个condition_string 有错误。括号在括起来的引号之前结束。应该是:condition_string = "colA='yes' & (colB='yes' | colB='no') & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
    • 关于colB in 我已经更新了我的代码,看看吧!就像我说的,根据condition_string 的语法,您可能想要扩展替换。
    • 如果condition_string也可以像condition_string = "col.A in ['Osel', 'Quine', 'Lovir (Kaletra)', 'Lan ate', 'Dar/cob']: 'Yes', ELSE: col.B"这样的情况我如何更新,我一直在尝试扩展替换,但没有成功
    【解决方案2】:

    你可以使用np.where:

    df['results'] =  np.where((((df['colA']=='yes') & (df['colB']=='yes')) & ((df['colC']=='yes') | (df['colD']=='yes'))), 'Yes',np.where(((df['colA']=='no') & (df['colB']=='no')) & ((df['colC']=='no' )| (df['colD']=='no')), 'No','UNKNOWN'))
    
    

    给出:

     ID colA colB colC colD decision
    0  AB01  yes  NaN  yes  yes  UNKNOWN
    1  AB02  yes  yes  yes   no      Yes
    2  AB03  yes  yes  yes  yes      Yes
    3  AB03   no   no   no   no       No
    4  AB04   no   no   no  NaN       No
    5  AB05  yes  NaN  NaN   no  UNKNOWN
    6  AB06  NaN  yes  NaN  NaN  UNKNOWN
    

    【讨论】:

    • 这是我想要避免的。我不想硬编码 condition_string 中的东西。我想从 condition_string 本身获取它或将其重组为可用于应用于数据帧的字典
    • 那么你应该接受@Henry Yik 的回答,作为接受的答案。
    【解决方案3】:

    IIUC 您想为您的df 创建任意条件,这可以使用functools.reduceoperator.and_ 完成。然后,您可以使用两个列表(而不是 dict)设置条件,第一个是列,第二个是要测试的字符串,最后是 np.select:

    from functools import reduce
    from operator import and_
    
    cols = ["colA", "colB", ["colC", "colD"]] # group the cols in a list if they belong to the same group
    answer = ["yes", "no"]
    
    conds = [reduce(and_, [df[i].eq(ans) if isinstance(i, str) else df[i].eq(ans).any(1)
                           for i in cols]) for ans in answer]
    
    df["result"] = np.select(conds, answer, "Unknown")
    
    print (df)
    
         ID colA colB colC colD   result
    0  AB01  yes  NaN  yes  yes  Unknown
    1  AB02  yes  yes  yes   no      yes
    2  AB03  yes  yes  yes  yes      yes
    3  AB03   no   no   no   no       no
    4  AB04   no   no   no  NaN       no
    5  AB05  yes  NaN  NaN   no  Unknown
    6  AB06  NaN  yes  NaN  NaN  Unknown
    

    如果您需要调整条件,现在只需编辑colsanswer 这两个列表。

    【讨论】:

      猜你喜欢
      • 2021-11-24
      • 1970-01-01
      • 1970-01-01
      • 2017-04-11
      • 1970-01-01
      • 2021-07-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多