【问题标题】:pythonic way to rank and then merge duplicated rows in a dataframepythonic方式对数据框中的重复行进行排名然后合并
【发布时间】:2020-12-08 08:10:50
【问题描述】:

我有以下格式的大型数据框:

name       ingredient       colour      similarity      ids      city      country     proba
pesto      ba               g           0.93            4        ve        it          0.85
pesto      sa               p           0.93            3        to        ca          0.92
pesto      li               y           0.99            6        lo        en          0.81
pasta      fl               w           0.88            2        de        in          0.8
pasta      wa               b           0.93            1        da        te          0.84
egg        eg               w           1               5        ro        ja          0.99

我想通过similarity 对所有name 进行排名(更高的相似性将具有更高的排名,如果 2 行具有相同的相似性,那么它们的附加顺序无关紧要)然后将所有重复的行合并在一起

输出如下所示:

name   ingredient          colour           similarity         ids        city               country            proba
pesto  ['li', 'ba', 'sa']  ['y', 'g', 'p']  [0.99, 0.93, 0.93] [6, 4, 3]  ['lo', 've', 'to'] ['en', 'it', 'ca'] [0.81, 0.85, 0.92]
pasta  ['wa', 'fl']        ['b', 'w']       [0.93, 0.88]       [1, 2]     ['da', 'de']       ['te', 'in']       [0.84, 0.8]
egg    ['eg']              ['w']            [1]                [5]        ['ro']             ['ja']             [0.99]

【问题讨论】:

  • 哎呀,最后我想念ao,对不起

标签: python pandas numpy dataframe group-by


【解决方案1】:

如果name 的顺序很重要,首先将name 转换为有序分类以进行原始排序,然后按DataFrame.sort_values 按两列排序,最后聚合lists:

df['name'] = pd.Categorical(df['name'], ordered=True, categories=df['name'].unique())

df1=df.sort_values(['name','similarity'], ascending=[True, False]).groupby('name').agg(list)

print (df1)
         ingredient     colour          similarity        ids          city  \
name                                                                          
pesto  [li, ba, sa]  [y, g, p]  [0.99, 0.93, 0.93]  [6, 4, 3]  [lo, ve, to]   
pasta      [wa, fl]     [b, w]        [0.93, 0.88]     [1, 2]      [da, de]   
egg            [eg]        [w]               [1.0]        [5]          [ro]   

            country               proba  
name                                     
pesto  [en, it, ca]  [0.81, 0.85, 0.92]  
pasta      [te, in]         [0.84, 0.8]  
egg            [ja]              [0.99]  

另一个想法是按组排序:

df1 = (df.groupby('name', group_keys=False, sort=False)
         .apply(lambda x: x.sort_values('similarity', ascending=False))
         .groupby('name', sort=False).agg(list))

如果name 的顺序是可能的排序,例如降序:

df2 = (df.sort_values(['name','similarity'], ascending=False)
         .groupby('name', sort=False)
         .agg(list))

print (df2)
         ingredient     colour          similarity        ids          city  \
name                                                                          
pesto  [li, ba, sa]  [y, g, p]  [0.99, 0.93, 0.93]  [6, 4, 3]  [lo, ve, to]   
pasta      [wa, fl]     [b, w]        [0.93, 0.88]     [1, 2]      [da, de]   
egg            [eg]        [w]               [1.0]        [5]          [ro]   

            country               proba  
name                                     
pesto  [en, it, ca]  [0.81, 0.85, 0.92]  
pasta      [te, in]         [0.84, 0.8]  
egg            [ja]              [0.99]  

【讨论】:

  • 我也在考虑按照同样的思路对namesimilarity进行排序,pd.Categorical是一个不错的方法!
猜你喜欢
  • 2018-03-30
  • 2016-08-09
  • 2019-02-21
  • 1970-01-01
  • 2023-03-13
  • 1970-01-01
  • 2020-04-29
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多