将 Python 函数应用于一个 pandas 列并将输出应用于多列答案

【问题标题】：Apply Python function to one pandas column and apply the output to multiple columns将 Python 函数应用于一个 pandas 列并将输出应用于多列
【发布时间】：2020-12-09 04:34:45
【问题描述】：

大家好，

我已经阅读了很多答案和博客，但我无法弄清楚我错过了什么简单的事情！。我正在使用“条件”函数来定义所有条件并将其应用于一个数据框列。如果条件满足，它应该创建/更新 2 个新的数据框列 'cat' 和 'subcat'。

如果你们能在这里帮助我，那将是一个很大的帮助！

dict = {'remark':['NA','NA','Category1','Category2','Category3'],
        'desc':['Present','Present','NA','Present','NA']
} 

df = pd.DataFrame(dict)

数据框看起来像这样：

          remark       desc
0         NA           Present      
1         NA           Present        
2         Category1    NA                   
3         Category2    Present                   
4         Category3    NA

我写了一个函数来定义如下条件：

def conditions(s):

    if (s == 'Category1'):
        x = 'insufficient'
        y = 'resolution'
    elif (s=='Category2):
        x= 'insufficient'
        y= 'information'
    elif (s=='Category3):
        x= 'Duplicate'
        y= 'ID repeated'
    else:
        x= 'NA'
        y= 'NA'
    
    return (x,y)

我有多种想法可以在数据框列上执行上述功能，但没有运气。

df[['cat','subcat']] = df['remark'].apply(lambda x: pd.Series([conditions(df)[0],conditions(df)[1]]))

我预期的数据框应该是这样的：

          remark       desc        cat           subcat
0         NA           Present     NA            NA      
1         NA           Present     NA            NA
2         Category1    NA          insufficient  resolution         
3         Category2    Present     insufficient  information              
4         Category3    NA          Duplicate     ID repeated

非常感谢。

【问题讨论】：

标签： python pandas dataframe lambda apply

【解决方案1】：

解决此问题的一种方法是使用列表理解：

df[['cat', 'subcat']] = [("insufficient", "resolution")  if word == "Category1" else 
                         ("insufficient", "information") if word == "Category2" else
                         ("Duplicate", "ID repeated")    if word == "Category3" else 
                         ("NA", "NA")
                         for word in df.remark]

  remark      desc               cat         subcat
0   NA        Present          NA              NA
1   NA        Present          NA              NA
2   Category1   NA          insufficient    resolution
3   Category2   Present     insufficient    information
4   Category3   NA          Duplicate       ID repeated

@dm2 的回答显示了如何使用您的功能实现它。第一个apply(conditions) 创建一个包含元组的系列，第二个apply 创建单独的列，形成一个数据框，然后您可以将其分配给cat 和subcat。

我建议使用列表理解的原因是，您正在处理字符串，而在 Pandas 中，通过 vanilla python 处理字符串往往更快。此外，通过列表理解，处理完成一次，您无需应用条件函数然后调用pd.Series。这为您提供更快的速度。测试将断言或揭穿这一点。

【讨论】：

我喜欢你的列表理解想法。我之前尝试过同样的方法，但由于 df 中不存在列 ('cat'&'subcat')。它给出了'KeyError'。任何想法？ KeyError：“[Index(['cat', 'subcat'], dtype='object')] 中没有 [index]”
是的，使用 timeit 这是一种方式，方式更快：我的代码运行 1000 次的 7 次平均为 0.3 秒，您的代码在 0.009 秒内运行相同。
我不知道为什么你有这个错误@nealkaps - 上面的代码应该创建两个新列。如果您使用的是笔记本，您可以重新启动内核并再次运行单元。

【解决方案2】：

你可以这样做：

 df[['cat','subcat']] = df['remark'].apply(conditions).apply(pd.Series)

输出：

  remark      desc               cat         subcat
0   NA        Present          NA              NA
1   NA        Present          NA              NA
2   Category1   NA          insufficient    resolution
3   Category2   Present     insufficient    information
4   Category3   NA          Duplicate       ID repeated

编辑：这可能是应用你已有的函数的更简单方法，但如果你有一个巨大的 DataFrame，为了更快的代码检查@sammywemmy 使用列表理解的答案。

【讨论】：

【解决方案3】：

您正在传递整个 dataframe，您只需传递 lambda 变量 (x)。

df[['cat','subcat']] = df['remark'].apply(lambda x: pd.Series([*conditions(x)]))

* on iterables 可以unpack 他们，所以你不需要调用相同的函数两次来提取输出。也许编译器解决了这个问题，但我不这么认为......

【讨论】：

【解决方案4】：

您可以将series.replace 与映射字典一起使用

df['cat'] = df.remark.replace({'Category1': 'insufficient',
    'Category2': 'insufficient', 'Category3': 'Duplicate'})
df['subcat'] = df.remark.replace({'Category1': 'resolution',
    'Category2': 'information', 'Category3': 'ID repeated'})

print(df)
      remark     desc           cat       subcat
0         NA  Present            NA           NA
1         NA  Present            NA           NA
2  Category1       NA  insufficient   resolution
3  Category2  Present  insufficient  information
4  Category3       NA     Duplicate  ID repeated

【讨论】：