一次对列中的多个分类数据进行热编码答案

【问题标题】：One Hot Encoding Multiple Categorical Data in a Column一次对列中的多个分类数据进行热编码
【发布时间】：2020-10-06 00:54:42
【问题描述】：

这里是初学者。我想在我的数据帧上使用一种热编码，该数据帧在一列中有多个分类数据。我的数据框看起来像这样，尽管列中有更多内容，因此我无法手动完成：

Title       column
Movie 1   Action, Fantasy
Movie 2   Fantasy, Drama
Movie 3   Action
Movie 4   Sci-Fi, Romance, Comedy
Movie 5   NA
etc.

我想要的输出：

 Title     Action  Fantasy  Drama  Sci-Fi  Romance  Comedy
Movie 1     1       1        0      0        0       0
Movie 2     0       1        1      0        0       0
Movie 3     1       0        0      0        0       0
Movie 4     0       0        0      1        1       1
Movie 5     0       0        0      0        0       0  
etc.

谢谢！

【问题讨论】：

你是否提前知道列中所有可能的值（即A、B、...F）？
欢迎来到 Stackoverflow！请使用所需的行为、特定问题和重现问题的代码更新您的问题。请参阅：如何创建a Minimal, Complete, and Verifiable example。
@AlenaVolkova 是的，我知道列中的可能值。
@HeisAif 我不确定该放什么，因为到目前为止我唯一拥有的是来自 .csv 文件的数据框。我现在的问题是如何在我的数据帧上使用一种热编码。

标签： python one-hot-encoding

【解决方案1】：

考虑输入数据为：

import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'], 
        'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
    Title   column
0   Movie 1 Action, Fantasy
1   Movie 2 Fantasy, Drama
2   Movie 3 Action
3   Movie 4 Sci-Fi, Romance, Comedy
4   Movie 5 NaN

此代码产生所需的输出：

# treat null values
df['column'].fillna('NA', inplace = True)

# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()

# flatten the list
flat_genre = [item for sublist in genre for item in sublist]

# convert to a set to make unique
set_genre = set(flat_genre)

# back to list
unique_genre = list(set_genre)

# remove NA
unique_genre.remove('NA')

# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)

# for each value inside column, update the dummy
for index, row in df.iterrows():
    for val in row.column.split(', '):
        if val != 'NA':
            df.loc[index, val] = 1

df.drop('column', axis = 1, inplace = True)    
df
    Title   Action  Fantasy Comedy  Sci-Fi  Drama   Romance
0   Movie 1 1       1       0       0       0       0
1   Movie 2 0       1       0       0       1       0
2   Movie 3 1       0       0       0       0       0
3   Movie 4 0       0       1       1       0       1
4   Movie 5 0       0       0       0       0       0

更新：我在测试数据中添加了一个空值，并在解决方案的第一行对其进行了适当的处理。

【讨论】：

当我在代码中声明数据时，此代码有效，但当我使用此代码从 csv 文件中获取数据时无效：df = pd.DataFrame(pd.read_csv('Sample.csv', names = ['Title', 'column'])) flat_genre 行出现错误，表示“float” ' 对象不可迭代。
我明白了，您的数据包含一个空值，而不是字符串 NA。 null 值在 numpy 中表示为 float 类型的数字常量。我已经更改了测试数据以考虑这种情况并在解决方案中对此进行处理。请检查它现在是否有效。

【解决方案2】：

### Import libraries and load sample data

import numpy as np
import pandas as pd

data = {
    'Movie 1': ['Action, Fantasy'],
    'Movie 2': ['Fantasy, Drama'],
    'Movie 3': ['Action'],
    'Movie 4': ['Sci-Fi, Romance, Comedy'],
    'Movie 5': ['NA'],
}

df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)

在这个阶段，我们的 DataFrame 如下所示：

           column
Movie 1    Action, Fantasy
Movie 2    Fantasy, Drama
Movie 3    Action
Movie 4    Sci-Fi, Romance, Comedy
Movie 5    NA

现在，我们要问的问题是 - 给定类型词（“子字符串”）是否出现在给定电影的“列”中？

为此，我们首先需要一个体裁词列表：

### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))

根据您的数据集有多大，计算成本可能很高。您提到您已经知道独特的价值。所以你可以手动定义可迭代的“流派”。

获取 OneHotVectors：

for genre in genres:
    df[genre] = df['column'].str.contains(genre).astype('int')

df.drop('column', axis=1, inplace=True)

我们遍历每个流派，询问流派是否存在于“列”中，这会返回 True 或 False，分别转换为 1 或 0 - 当我们转换为 type('int') 时。

我们最终得到：

          Action    Comedy  Drama   Fantasy Romance Sci-Fi
Movie 1        1         0      0         1       0      0
Movie 2        0         0      1         1       0      0
Movie 3        1         0      0         0       0      0
Movie 4        0         1      0         0       1      1
Movie 5        0         0      0         0       0      0

【讨论】：

这也有效，但是当我使用df = pd.DataFrame(pd.read_csv('Sample.csv', names = ['Title', 'column'])) 从 csv 文件中获取数据时，我也遇到了错误。错误出现在genres = np.unique(', '.join(df['column']).split(', ')) 行并显示“序列项 4：预期的 str 实例，已找到浮点数”。
@razortight 这可能是因为您的列中有 NaN。 NaN 是浮点类型。你可以做一个df.fillna('NA', inplace=True)。此外，您不需要在 pd.read_csv() 之上添加 pd.DataFrame() - 从技术上讲，它没有任何“错误”，它只是多余的，没有任何区别。