使用 pandas 从一列字典创建一个热编码答案

【问题标题】：Creating one hot encodings from a column of dictionaries with pandas使用 pandas 从一列字典创建一个热编码
【发布时间】：2018-06-21 03:24:28
【问题描述】：

我正在开发一个使用公共 IMDB 数据集的项目，并希望从每个子字符串中提取流派数据并将此信息存储在单独的列中。这是我目前拥有的。

当前： ID 类型 1995 [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "科幻小说"}]

我正在努力实现的目标是将数据分成与电影 ID 对应的每个类型，例如电影 ID 1995：动作、冒险、奇幻、科幻

总而言之，我有多个包含我想要的字符串，我想为每个 ID 提取相关数据（流派）。

如何在 python 中做到这一点，我一直在玩 pandas，但目前只能获得一种类型的 True/false。

CSV 文件here

import pandas as pd
import numpy as np
import os
import re
import matplotlib.pyplot as plt
# Order of the Column headers for the re-arranged data

Genres = ['Action','Adventure','Biography','Comedy','Crime','Documentary','Drama','Family','Fantasy',
          'Film-Noir''History','Horror','Musical','Mystery','News','Romance','Sci-Fi','Short','Sport',
          'Thriller','War','Western']

os.chdir('C:\\Users\parmi\Documents\Python Scripts')
org_data = pd.read_csv('tmdb_5000_movies.csv')


film_id = pd.DataFrame(org_data)['id']
genre_data = pd.DataFrame(org_data)['genres']

genre_data= genre_data.str.extract(Genre)
genre_combined = pd.concat([film_id,genre_data], axis=1)
genre_combined.to_csv('genre_data2.csv')

【问题讨论】：

首先在 panda 中发布包含演示数据的代码
你想要 0 还是 1 的列，这取决于是否包含流派？
是的，这对我想要实现的目标非常有效。
请参阅下面的答案，如果有帮助，请将其标记为已接受。谢谢。

标签： python pandas dataframe

【解决方案1】：

首先，加载您的数据 -

df = pd.read_csv('tmdb_5000_movies.csv')

接下来，genres 包含 JSON 数据，因此将其加载为 dicts 列 -

v = df.genres.apply(json.loads)

接下来，使用np.repeat 展平您的数据 -

df = pd.DataFrame(
{
    'id' : df['id'].values.repeat(v.str.len(), axis=0),
    'genre' : np.concatenate(v.tolist())
})

通过从每个字典中检索name 属性，将genre 从字典列转换为字符串列。

df['genre'] = df['genre'].map(lambda x: x.get('name'))

最后，使用str.get_dummies 计算一个热编码 -

ohe = df.set_index('id')\
        .genre.str.get_dummies()\
        .sum(level=0)\

ohe.head(10)

        Action  Adventure  Animation  Comedy  Crime  Documentary  Drama  \
id                                                                        
19995        1          1          0       0      0            0      0   
285          1          1          0       0      0            0      0   
206647       1          1          0       0      1            0      0   
49026        1          0          0       0      1            0      1   
49529        1          1          0       0      0            0      0   
559          1          1          0       0      0            0      0   
38757        0          0          1       0      0            0      0   
99861        1          1          0       0      0            0      0   
767          0          1          0       0      0            0      0   
209112       1          1          0       0      0            0      0   

        Family  Fantasy  Foreign  History  Horror  Music  Mystery  Romance  \
id                                                                           
19995        0        1        0        0       0      0        0        0   
285          0        1        0        0       0      0        0        0   
206647       0        0        0        0       0      0        0        0   
49026        0        0        0        0       0      0        0        0   
49529        0        0        0        0       0      0        0        0   
559          0        1        0        0       0      0        0        0   
38757        1        0        0        0       0      0        0        0   
99861        0        0        0        0       0      0        0        0   
767          1        1        0        0       0      0        0        0   
209112       0        1        0        0       0      0        0        0   

        Science Fiction  TV Movie  Thriller  War  Western  
id                                                         
19995                 1         0         0    0        0  
285                   0         0         0    0        0  
206647                0         0         0    0        0  
49026                 0         0         1    0        0  
49529                 1         0         0    0        0  
559                   0         0         0    0        0  
38757                 0         0         0    0        0  
99861                 1         0         0    0        0  
767                   0         0         0    0        0  
209112                0         0         0    0        0

【讨论】：