【问题标题】：write a python function that takes a list and one column of a dataframe and adds a new column based on that list编写一个 python 函数，它接受一个列表和一个数据框的一列，并根据该列表添加一个新列
【发布时间】：2021-03-19 00:30:14
【问题描述】：

我大大简化了这一点 - 实际列将包含多达 500 个字符，子字符串列表将有 60 个值，介于 10-80 个字符之间

数据框比这更复杂，列表将包含 60 个值，DF 将有 11,000 行，但这是我想要做的

我有一个这样的数据框和一个列表

my_list = ['alabama 500', 'beta 15', 'carthouse', 'd320 blend']

df = pd.DataFrame({'col1':['left side alabama 500 on the right side carthouse', '1st entry is at beta 15', 'this one takes a mix of d320 blend and beta 15']})


    col1
0   left side alabama 500 on the right side carthouse
1   1st entry is at beta 15
2   this one takes a mix of d320 blend and beta 15

我正在尝试编写一个函数来返回它，保持第一列完整，并在原始列完整的新列中返回子字符串

df['col2'] 
    col1                                                  col2
0   left side alabama 500 on the right side carthouse     alabama 500
1   left side alabama 500 on the right side carthouse     carthouse
2   1st entry is at beta 15                               beta 15
3   this one takes a mix of d320 blend and beta 15        beta 15
4   this one takes a mix of d320 blend and beta 15        d320 blend

这是我尝试过的

def add_new_col(data, col_name, my_list):
    #function looks at the column col_name in a dataframe data, if the substring exists, it adds a new
    #column with only that substring, keeping multiples
    
    for i in my_list:
        if data[col_name].str.contains(i):
            data['col2'] = i
        else:
            continue
    return data

在笔记本中运行函数

my_list = ['a', 'b', 'c', 'd']
add_new_col(df, 'col1', my_list)

返回此错误：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

基于其他一些答案，我也尝试了这个

def add_new_col(data, col_name, my_list):
    #function looks at the column col_name in a dataframe data, if the substring exists, it adds a new
    #column with only that substring, keeping multiples
    
    for i in my_list:
        if data[data[col_name].str.contains(i)]:
            data['col2'] = i
        else:
            continue
    return data

给出了相同的错误代码

【问题讨论】：

不理解 col2 的逻辑。例如，col1 中的第 0 行和第 1 行都是“acd”，但在 col2 中，我们得到第 0 行的“a”和第 1 行的“c”。

标签： python pandas string dataframe

【解决方案1】：

您可以使用explode 方法：

df2 = df.assign(col2=lambda f: f.col1.apply(list)).explode("col2")
print(df2)

  col1 col2
0  acd    a
0  acd    c
0  acd    d
1    a    a
2   db    d
2   db    b

如果您想摆脱索引，只需添加： df2 = df2.reset_index(drop=True)

【讨论】：

实际的第 1 列将包含最多 500 个字符，这是过于简单化了，第 2 列是一个包含 60 个项目的列表，所有字符串最长为 80 个字符

【解决方案2】：

您可以使用str.split 和concat：

import numpy as np
s = df['col1'].str.split('',expand=True).replace('',np.nan).stack()\
              .reset_index(1,drop=True).to_frame('col2')

df1 = pd.concat([df,s],1)

  col1 col2
0  acd    a
0  acd    c
0  acd    d
1    a    a
2   db    d
2   db    b

【讨论】：