【问题标题】:How to extract values from one column and create separate binary columns based on the targeted column in python如何从一列中提取值并根据python中的目标列创建单独的二进制列
【发布时间】:2021-11-27 02:25:36
【问题描述】:

我有一个数据集如下。原因是唯一给出的列,其他列是我想要的输出

reason           business_name  name  individual_name   DOB
business name       Yes          No       No             No
name                No           Yes      No             No
business name       Yes          No       No             No
individual_name     No           No       Yes            No
DOB                 No           No       No             Yes
Business name,name  Yes          Yes      No             No

原因字段是我唯一的列,我想创建几个单独的列以二进制格式存储结果。

当前的代码看起来很愚蠢。在实际数据中,原因列有 10 多个唯一值。 我创建了 10+ 个关键字列表来存储原因关键字,以及 10+ 个空列表用于追加('Yes')或('No') 示例逻辑:

for comment in  df['reason'] :
    if any(x in comment for x in keywords1):
        lis1.append('Yes')
    else:
        lis1.append('No')
         .
         .

However, when scanning the value as name, 
both the business_name column and name will be yes. I think because the name both exists in keywords1 and keyword2.
keywords1=['business name'] keyword2 =['name'] 

这不是我真正想要的,我希望仅当原因具有值时才将其分开:企业名称,名称。不知道如何解决它并减少手动创建 10 多个列表。

提前致谢!

【问题讨论】:

标签: python python-3.x pandas list dataframe


【解决方案1】:

先解释,后面跟着代码

获取真正独特原因的列表。您可以使用 dropna()

在此处删除任何 NA
import pandas as pd
from itertools import chain

# you can probably skip this list if you already have the dataframe
reasons = [       
    'business name'  ,
    np.nan,    
    'name'   ,            
    'business name'   ,    
    'individual_name' ,    
    'DOB'     ,            
    'Business name,name']
    
    df = pd.DataFrame(reasons)
    df.columns=['reason']
    
    unique_reasons = pd.unique(df.reason.dropna()).tolist()
    
    # get any item that has a comma, and split it into separate pieces
    splits = [x.split(',') for x in unique_reasons if ',' in x]
    #take out all the items you just split from the main list
    unique_reasons =[y for y in unique_reasons if ',' not in y]
    # combine the two lists, and make sure that each item in final combined list is only in there one time
    new_list = unique_reasons + list(chain.from_iterable(splits))
    
    unique_reasons_set = set(new_list)

为 unique_reason_set 中的每个项目创建一个布尔掩码,如果 df['reason'] 包含该项目作为字符串,则写入 True,否则为 False。

import numpy as np

new_cols = []
    for item in unique_reasons_set:
        col = np.where(df['reason'].str.contains(item), True, False)
        new_cols.append(col)
    

获取所有这些新列并将它们连接到原始数据框

    df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
    df = pd.concat([df,df2], axis=1)

完整代码

import pandas as pd
import numpy as np
from itertools import chain
reasons = [       
'business name'  ,
np.nan,    
'name'   ,            
'business name'   ,    
'individual_name' ,    
'DOB'     ,            
'Business name,name']

df = pd.DataFrame(reasons)
df.columns=['reason']

unique_reasons = pd.unique(df.reason.dropna()).tolist()

# get any item that has a comma, and split it into separate pieces
splits = [x.split(',') for x in unique_reasons if ',' in x]
#take out all the items you just split from the main list
unique_reasons =[y for y in unique_reasons if ',' not in y]
# combine the two lists, and make sure that each item in final combined list is only in there one time. Need the chain to flatten a 2d list that results from split
new_list = unique_reasons + list(chain.from_iterable(splits))

unique_reasons_set = set(new_list)


new_cols = []
for item in unique_reasons_set:
    col = np.where(df['reason'].str.contains(item), True, False)
    new_cols.append(col)


df2 = pd.DataFrame.from_dict(dict(zip(unique_reasons_set, new_cols)))
df = pd.concat([df,df2], axis=1)

【讨论】:

  • 同样在做集合时,TypeError: unhashable type: 'list' 返回。
  • 还有另一个语法错误:ValueError: Wrong number of items passed 30,placement means 1
  • 抱歉,我最初没有检查就抛出了答案。我回去验证了我的工作,并编辑了原件。
猜你喜欢
  • 2021-08-31
  • 2021-03-27
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多