【问题标题】:Split string with missing data/levels over several columns在多列中拆分缺少数据/级别的字符串
【发布时间】:2021-12-31 23:06:51
【问题描述】:

假设我们有这个df:

df = pd.DataFrame({
        'string': ['blue',
                   'blue,red',
                   'red',
                   'purple',
                   'blue,red,green,yellow,magenta,purple',
                   '',
                   'yellow,purple']})

假设我们想在每个逗号处拆分这些字符串并将它们放入新列中。如发现这里 (How can I separate text into multiple values in a CSV file using Python?) 和这里 (Split one column into multiple columns by multiple delimiters in Pandas),我们可以使用 str.split 所以:

df[['blue', 'red', 'green', 'yellow', 'magenta', 'purple']] = df['string'].str.split(',',expand=True)
df

结果:

    string                                  blue    red     green   yellow  magenta purple

0   blue                                    blue    None    None    None    None    None
1   blue,red                                blue    red     None    None    None    None
2   red                                     red     None    None    None    None    None
3   purple                                  purple  None    None    None    None    None
4   blue,red,green,yellow,magenta,purple    blue    red     green   yellow  magenta purple
5                                                   None    None    None    None    None
6   yellow,purple                           yellow  purple  None    None    None    None

索引 0、1 和 4 按我的意愿工作,颜色在新列中正确分类。

对于其他索引,颜色分类错误。请注意,上面链接的示例不能解决我的问题,因为它们没有丢失数据点/级别。我该如何解决? (另外,为索引 5 的“蓝色”列添加“无”?)

非常感谢

【问题讨论】:

    标签: python pandas string dataframe


    【解决方案1】:
    import pandas as pd
    from typing import List
    from pandas import DataFrame    
    
    
    # your input
    df = pd.DataFrame({
            'string': ['blue',
                       'blue,red',
                       'red',
                       'purple',
                       'blue,red,green,yellow,magenta,purple',
                       '',
                       'yellow,purple']})
    
    
    # define columns
    input_col = "string"    
    encoded_col = "string_enriched"
    
    # unique colors
    unique_colors: List[str] = sorted([x for x in set(sum([x.split(",") for x in df[input_col].tolist()], [])) if x])
    
    # create column with unique colors encoded
    df[encoded_col] = df[input_col].apply(lambda x: (1 if w in x.split(",") else 0 for w in unique_colors))
    
    # enrich encoded column swapping 0,1 to colors
    df[encoded_col] = df.apply(lambda row: (w[1] if w[0] else "" for w in zip(row[encoded_col], unique_colors)), axis=1)
    
    # explode column into multiple columns
    dx = pd.DataFrame(df[encoded_col].to_list(), columns = unique_colors)
    
    # concat two dataframes horizontaly
    df_out: DataFrame = pd.concat([df[input_col], dx], axis=1)
    

    【讨论】:

    • 非常感谢您提供替代解决方案! +1
    【解决方案2】:

    我会这样做

    import pandas as pd
    df = pd.DataFrame({
            'string': ['blue',
                       'blue,red',
                       'red',
                       'purple',
                       'blue,red,green,yellow,magenta,purple',
                       '',
                       'yellow,purple']})
    def convert_to_dict(string):
        return dict((i,i) for i in string.split(",") if i)
    df2 = df['string'].apply(convert_to_dict).apply(pd.Series)
    finaldf = pd.concat([df,df2],axis=1)
    print(finaldf)
    

    输出

                                     string  blue  red  purple  green  yellow  magenta
    0                                  blue  blue  NaN     NaN    NaN     NaN      NaN
    1                              blue,red  blue  red     NaN    NaN     NaN      NaN
    2                                   red   NaN  red     NaN    NaN     NaN      NaN
    3                                purple   NaN  NaN  purple    NaN     NaN      NaN
    4  blue,red,green,yellow,magenta,purple  blue  red  purple  green  yellow  magenta
    5                                         NaN  NaN     NaN    NaN     NaN      NaN
    6                         yellow,purple   NaN  NaN  purple    NaN  yellow      NaN
    

    说明:重要的部分是将字符串转换为dict,因此对于每个键值对,键值相等。然后我将(单个pandas.Series 持有dicts)转换为pandas.DataFramepandas.concat 与原始pandas.DataFrame

    【讨论】:

    • 非常感谢@Daweo。为这个使用字典的不错的替代解决方案 +1。
    【解决方案3】:

    使用Series.str.get_dummies:

    df1 = df.join(df['string'].str.get_dummies(','))
    print (df1)
                                     string  blue  green  magenta  purple  red  \
    0                                  blue     1      0        0       0    0   
    1                              blue,red     1      0        0       0    1   
    2                                   red     0      0        0       0    1   
    3                                purple     0      0        0       1    0   
    4  blue,red,green,yellow,magenta,purple     1      1        1       1    1   
    5                                           0      0        0       0    0   
    6                         yellow,purple     0      0        0       1    0   
    
       yellow  
    0       0  
    1       0  
    2       0  
    3       0  
    4       1  
    5       0  
    6       1  
    

    如果还需要值而不是 1None 而不是 0 添加:

    df1 = df['string'].str.get_dummies(',')
    df = df.join(pd.DataFrame(np.where(df1, df1.columns.to_series(),None), 
                              index=df1.index,
                              columns=df1.columns))
    
    print (df)
                                     string  blue  green  magenta  purple   red  \
    0                                  blue  blue   None     None    None  None   
    1                              blue,red  blue   None     None    None   red   
    2                                   red  None   None     None    None   red   
    3                                purple  None   None     None  purple  None   
    4  blue,red,green,yellow,magenta,purple  blue  green  magenta  purple   red   
    5                                        None   None     None    None  None   
    6                         yellow,purple  None   None     None  purple  None   
    
       yellow  
    0    None  
    1    None  
    2    None  
    3    None  
    4  yellow  
    5    None  
    6  yellow  
    

    【讨论】:

    • 这是一个非常优雅的解决方案! +1
    • 像魅力一样工作。非常感谢@jezrael
    猜你喜欢
    • 1970-01-01
    • 2011-05-20
    • 1970-01-01
    • 1970-01-01
    • 2017-06-12
    相关资源
    最近更新 更多