如何从一列中提取值并根据python中的目标列创建单独的二进制列答案

【问题标题】：How to extract values from one column and create separate binary columns based on the targeted column in python如何从一列中提取值并根据python中的目标列创建单独的二进制列
【发布时间】：2021-11-27 02:25:36
【问题描述】：

我有一个数据集如下。原因是唯一给出的列，其他列是我想要的输出

reason           business_name  name  individual_name   DOB
business name       Yes          No       No             No
name                No           Yes      No             No
business name       Yes          No       No             No
individual_name     No           No       Yes            No
DOB                 No           No       No             Yes
Business name,name  Yes          Yes      No             No

原因字段是我唯一的列，我想创建几个单独的列以二进制格式存储结果。

当前的代码看起来很愚蠢。在实际数据中，原因列有 10 多个唯一值。我创建了 10+ 个关键字列表来存储原因关键字，以及 10+ 个空列表用于追加（'Yes'）或（'No'）示例逻辑：

for comment in  df['reason'] :
    if any(x in comment for x in keywords1):
        lis1.append('Yes')
    else:
        lis1.append('No')
         .
         .

However, when scanning the value as name, 
both the business_name column and name will be yes. I think because the name both exists in keywords1 and keyword2.
keywords1=['business name'] keyword2 =['name']

这不是我真正想要的，我希望仅当原因具有值时才将其分开：企业名称，名称。不知道如何解决它并减少手动创建 10 多个列表。

提前致谢！

【问题讨论】：

您在寻找df['reason'].str.get_dummies(',').replace({0: 'No', 1: 'Yes'})吗？ Quickest way to make a get_dummies type dataframe from a column with a multiple of strings

标签： python python-3.x pandas list dataframe

【解决方案1】：

先解释，后面跟着代码

获取真正独特原因的列表。您可以使用 dropna()

在此处删除任何 NA

import pandas as pd
from itertools import chain

# you can probably skip this list if you already have the dataframe
reasons = [       
    'business name'  ,
    np.nan,    
    'name'   ,            
    'business name'   ,    
    'individual_name' ,    
    'DOB'     ,            
    'Business name,name']
    
    df = pd.DataFrame(reasons)
    df.columns=['reason']
    
    unique_reasons = pd.unique(df.reason.dropna()).tolist()
    
    # get any item that has a comma, and split it into separate pieces
    splits = [x.split(',') for x in unique_reasons if ',' in x]
    #take out all the items you just split from the main list
    unique_reasons =[y for y in unique_reasons if ',' not in y]
    # combine the two lists, and make sure that each item in final combined list is only in there one time
    new_list = unique_reasons + list(chain.from_iterable(splits))
    
    unique_reasons_set = set(new_list)

为 unique_reason_set 中的每个项目创建一个布尔掩码，如果 df['reason'] 包含该项目作为字符串，则写入 True，否则为 False。

import numpy as np

new_cols = []
    for item in unique_reasons_set:
        col = np.where(df['reason'].str.contains(item), True, False)
        new_cols.append(col)

获取所有这些新列并将它们连接到原始数据框

    df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
    df = pd.concat([df,df2], axis=1)

完整代码

import pandas as pd
import numpy as np
from itertools import chain
reasons = [       
'business name'  ,
np.nan,    
'name'   ,            
'business name'   ,    
'individual_name' ,    
'DOB'     ,            
'Business name,name']

df = pd.DataFrame(reasons)
df.columns=['reason']

unique_reasons = pd.unique(df.reason.dropna()).tolist()

# get any item that has a comma, and split it into separate pieces
splits = [x.split(',') for x in unique_reasons if ',' in x]
#take out all the items you just split from the main list
unique_reasons =[y for y in unique_reasons if ',' not in y]
# combine the two lists, and make sure that each item in final combined list is only in there one time. Need the chain to flatten a 2d list that results from split
new_list = unique_reasons + list(chain.from_iterable(splits))

unique_reasons_set = set(new_list)


new_cols = []
for item in unique_reasons_set:
    col = np.where(df['reason'].str.contains(item), True, False)
    new_cols.append(col)


df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
df = pd.concat([df,df2], axis=1)

【讨论】：

同样在做集合时，TypeError: unhashable type: 'list' 返回。
还有另一个语法错误：ValueError: Wrong number of items passed 30,placement means 1
抱歉，我最初没有检查就抛出了答案。我回去验证了我的工作，并编辑了原件。