在多列中拆分缺少数据/级别的字符串答案

【问题标题】：Split string with missing data/levels over several columns在多列中拆分缺少数据/级别的字符串
【发布时间】：2021-12-31 23:06:51
【问题描述】：

假设我们有这个df：

df = pd.DataFrame({
        'string': ['blue',
                   'blue,red',
                   'red',
                   'purple',
                   'blue,red,green,yellow,magenta,purple',
                   '',
                   'yellow,purple']})

假设我们想在每个逗号处拆分这些字符串并将它们放入新列中。如发现这里 (How can I separate text into multiple values in a CSV file using Python?) 和这里 (Split one column into multiple columns by multiple delimiters in Pandas)，我们可以使用 str.split 所以：

df[['blue', 'red', 'green', 'yellow', 'magenta', 'purple']] = df['string'].str.split(',',expand=True)
df

结果：

    string                                  blue    red     green   yellow  magenta purple

0   blue                                    blue    None    None    None    None    None
1   blue,red                                blue    red     None    None    None    None
2   red                                     red     None    None    None    None    None
3   purple                                  purple  None    None    None    None    None
4   blue,red,green,yellow,magenta,purple    blue    red     green   yellow  magenta purple
5                                                   None    None    None    None    None
6   yellow,purple                           yellow  purple  None    None    None    None

索引 0、1 和 4 按我的意愿工作，颜色在新列中正确分类。

对于其他索引，颜色分类错误。请注意，上面链接的示例不能解决我的问题，因为它们没有丢失数据点/级别。我该如何解决？（另外，为索引 5 的“蓝色”列添加“无”？）

非常感谢

【问题讨论】：

标签： python pandas string dataframe

【解决方案1】：

import pandas as pd
from typing import List
from pandas import DataFrame    


# your input
df = pd.DataFrame({
        'string': ['blue',
                   'blue,red',
                   'red',
                   'purple',
                   'blue,red,green,yellow,magenta,purple',
                   '',
                   'yellow,purple']})


# define columns
input_col = "string"    
encoded_col = "string_enriched"

# unique colors
unique_colors: List[str] = sorted([x for x in set(sum([x.split(",") for x in df[input_col].tolist()], [])) if x])

# create column with unique colors encoded
df[encoded_col] = df[input_col].apply(lambda x: (1 if w in x.split(",") else 0 for w in unique_colors))

# enrich encoded column swapping 0,1 to colors
df[encoded_col] = df.apply(lambda row: (w[1] if w[0] else "" for w in zip(row[encoded_col], unique_colors)), axis=1)

# explode column into multiple columns
dx = pd.DataFrame(df[encoded_col].to_list(), columns = unique_colors)

# concat two dataframes horizontaly
df_out: DataFrame = pd.concat([df[input_col], dx], axis=1)

【讨论】：

非常感谢您提供替代解决方案！ +1

【解决方案2】：

我会这样做

import pandas as pd
df = pd.DataFrame({
        'string': ['blue',
                   'blue,red',
                   'red',
                   'purple',
                   'blue,red,green,yellow,magenta,purple',
                   '',
                   'yellow,purple']})
def convert_to_dict(string):
    return dict((i,i) for i in string.split(",") if i)
df2 = df['string'].apply(convert_to_dict).apply(pd.Series)
finaldf = pd.concat([df,df2],axis=1)
print(finaldf)

输出

                                 string  blue  red  purple  green  yellow  magenta
0                                  blue  blue  NaN     NaN    NaN     NaN      NaN
1                              blue,red  blue  red     NaN    NaN     NaN      NaN
2                                   red   NaN  red     NaN    NaN     NaN      NaN
3                                purple   NaN  NaN  purple    NaN     NaN      NaN
4  blue,red,green,yellow,magenta,purple  blue  red  purple  green  yellow  magenta
5                                         NaN  NaN     NaN    NaN     NaN      NaN
6                         yellow,purple   NaN  NaN  purple    NaN  yellow      NaN

说明：重要的部分是将字符串转换为dict，因此对于每个键值对，键值相等。然后我将（单个pandas.Series 持有dicts）转换为pandas.DataFrame 和pandas.concat 与原始pandas.DataFrame。

【讨论】：

非常感谢@Daweo。为这个使用字典的不错的替代解决方案 +1。

【解决方案3】：

使用Series.str.get_dummies:

df1 = df.join(df['string'].str.get_dummies(','))
print (df1)
                                 string  blue  green  magenta  purple  red  \
0                                  blue     1      0        0       0    0   
1                              blue,red     1      0        0       0    1   
2                                   red     0      0        0       0    1   
3                                purple     0      0        0       1    0   
4  blue,red,green,yellow,magenta,purple     1      1        1       1    1   
5                                           0      0        0       0    0   
6                         yellow,purple     0      0        0       1    0   

   yellow  
0       0  
1       0  
2       0  
3       0  
4       1  
5       0  
6       1

如果还需要值而不是 1 和 None 而不是 0 添加：

df1 = df['string'].str.get_dummies(',')
df = df.join(pd.DataFrame(np.where(df1, df1.columns.to_series(),None), 
                          index=df1.index,
                          columns=df1.columns))

print (df)
                                 string  blue  green  magenta  purple   red  \
0                                  blue  blue   None     None    None  None   
1                              blue,red  blue   None     None    None   red   
2                                   red  None   None     None    None   red   
3                                purple  None   None     None  purple  None   
4  blue,red,green,yellow,magenta,purple  blue  green  magenta  purple   red   
5                                        None   None     None    None  None   
6                         yellow,purple  None   None     None  purple  None   

   yellow  
0    None  
1    None  
2    None  
3    None  
4  yellow  
5    None  
6  yellow

【讨论】：

这是一个非常优雅的解决方案！ +1
像魅力一样工作。非常感谢@jezrael