【问题标题】:Pandas groupby dictionaryPandas groupby 字典
【发布时间】:2017-12-22 09:21:02
【问题描述】:

pandas 新手,如果解决方案很明显,请见谅。

我有一个包含不同电影场景和该电影场景环境的数据框(见下文)

import pandas as pd
data = [{'movie' : 'movie_X', 'scene' : '1', 'environment' : 'home'}, 
        {'movie' : 'movie_X', 'scene' : '2', 'environment' : 'car'}, 
        {'movie' : 'movie_X', 'scene' : '3', 'environment' : 'home'}, 
        {'movie' : 'movie_Y', 'scene' : '1', 'environment' : 'home'}, 
        {'movie' : 'movie_Y', 'scene' : '2', 'environment' : 'office'}, 
        {'movie' : 'movie_Z', 'scene' : '1', 'environment' : 'boat'}, 
        {'movie' : 'movie_Z', 'scene' : '2', 'environment' : 'beach'}, 
        {'movie' : 'movie_Z', 'scene' : '3', 'environment' : 'home' }]
myDF = pd.DataFrame(data)

在这种情况下,电影具有它们所属的多种类型。我有一本字典(如下),描述了每部电影所属的类型

genreDict = {'movie_X' : ['romance', 'action'],
           'movie_Y' : ['comedy', 'romance', 'action'],
           'movie_Z' : ['horror', 'thriller', 'romance']}

我想按这本字典对 myDF 进行分组,特别是能够分辨出特定环境在特定类型中出现的次数(例如,在类型恐怖中,“船”被计算一次,“海滩”被计算为算一次,“家”算一次)。最好和最有效的方法是什么?我尝试将字典映射到数据框,然后按列表分组:

myDF['genres'] = myDF['movie'].map(genreDict)

返回:

   movie    scene    environment               genres
0  movie_X     1        home            [romance, action]
1  movie_X     2         car            [romance, action]
2  movie_X     3        home            [romance, action]
3  movie_Y     1        home    [comedy, romance, action]
4  movie_Y     2      office    [comedy, romance, action]
5  movie_Z     1        boat  [horror, thriller, romance]
6  movie_Z     2       beach  [horror, thriller, romance]
7  movie_Z     3        home  [horror, thriller, romance]

但是,我收到一条错误消息,提示该列表不可散列。希望大家能帮忙:)

【问题讨论】:

  • 你能发布你想要的数据集吗?

标签: list pandas dictionary dataframe pandas-groupby


【解决方案1】:

非标量对象通常会在 pandas 中引起问题。除此之外,您还需要整理数据,以便您的下一步更容易(表格结构的主要操作通常在整洁的数据集上定义)。您需要一个数据集,您不会在一行中列出所有类型,而是每个类型都有自己的行。

这是实现这一目标的可能方法之一:

genre_df = pd.DataFrame(myDF['movie'].map(genreDict).tolist())

df = myDF.join(genre_df.stack().rename('genre').reset_index(level=1, drop=True))
df
Out: 
  environment    movie scene     genre
0        home  movie_X     1   romance
0        home  movie_X     1    action
1         car  movie_X     2   romance
1         car  movie_X     2    action
2        home  movie_X     3   romance
2        home  movie_X     3    action
3        home  movie_Y     1    comedy
3        home  movie_Y     1   romance
3        home  movie_Y     1    action
4      office  movie_Y     2    comedy
4      office  movie_Y     2   romance
4      office  movie_Y     2    action
5        boat  movie_Z     1    horror
5        boat  movie_Z     1  thriller
5        boat  movie_Z     1   romance
6       beach  movie_Z     2    horror
6       beach  movie_Z     2  thriller
6       beach  movie_Z     2   romance
7        home  movie_Z     3    horror
7        home  movie_Z     3  thriller
7        home  movie_Z     3   romance

一旦你有了这样的结构,就更容易对数据进行分组或交叉制表:

df.groupby('genre').size()
Out: 
genre
action      5
comedy      2
horror      3
romance     8
thriller    3
dtype: int64

pd.crosstab(df['genre'], df['environment'])
Out: 
environment  beach  boat  car  home  office
genre                                      
action           0     0    1     3       1
comedy           0     0    0     1       1
horror           1     1    0     1       0
romance          1     1    1     4       1
thriller         1     1    0     1       0

这是 Hadley Wickham 的精彩读物:Tidy Data

【讨论】:

    【解决方案2】:

    如果更大的数据帧更快,则使用 numpy 重复行 listsnumpy.repeatnumpy.concatenateIndex.values

    #get length of lists in column genres
    l = myDF['genres'].str.len()
    #convert column to numpy array
    vals = myDF['genres'].values
    #repeat index by lenghts
    idx = np.repeat(myDF.index, l)
    #expand rows by duplicated index values 
    myDF = myDF.loc[idx]
    #flattening lists column
    myDF['genres'] = np.concatenate(vals)
    #default monotonic index (0,1,2...)
    myDF = myDF.reset_index(drop=True)
    print (myDF)
       environment    movie scene    genres
    0         home  movie_X     1   romance
    1         home  movie_X     1    action
    2          car  movie_X     2   romance
    3          car  movie_X     2    action
    4         home  movie_X     3   romance
    5         home  movie_X     3    action
    6         home  movie_Y     1    comedy
    7         home  movie_Y     1   romance
    8         home  movie_Y     1    action
    9       office  movie_Y     2    comedy
    10      office  movie_Y     2   romance
    11      office  movie_Y     2    action
    12        boat  movie_Z     1    horror
    13        boat  movie_Z     1  thriller
    14        boat  movie_Z     1   romance
    15       beach  movie_Z     2    horror
    16       beach  movie_Z     2  thriller
    17       beach  movie_Z     2   romance
    18        home  movie_Z     3    horror
    19        home  movie_Z     3  thriller
    20        home  movie_Z     3   romance
    

    然后使用groupby 并聚合size

    df1 = df.groupby(['genres','environment']).size().reset_index(name='count')
    print (df1)
          genres environment  count
    0     action         car      1
    1     action        home      3
    2     action      office      1
    3     comedy        home      1
    4     comedy      office      1
    5     horror       beach      1
    6     horror        boat      1
    7     horror        home      1
    8    romance       beach      1
    9    romance        boat      1
    10   romance         car      1
    11   romance        home      4
    12   romance      office      1
    13  thriller       beach      1
    14  thriller        boat      1
    15  thriller        home      1
    

    【讨论】:

      猜你喜欢
      • 2018-02-28
      • 1970-01-01
      • 1970-01-01
      • 2019-09-19
      • 2019-05-24
      • 2017-06-19
      • 2022-12-01
      • 2021-07-19
      • 2021-01-11
      相关资源
      最近更新 更多