如何以更快的方式计算平均值答案

【问题标题】：How to calculate averages in a faster way如何以更快的方式计算平均值
【发布时间】：2020-10-20 08:14:12
【问题描述】：

这是我的数据集的 sn-p：

 userId  movieId  rating            timestamp
  97809        1     3.0  2008-06-11 04:47:11
 106140        1     5.0  2013-01-29 03:33:49
 106138        1     3.0  2002-07-31 15:48:53
  70354        1     4.5  2011-02-13 18:55:40
  70355        1     3.5  2008-01-26 16:56:54
  70356        1     3.0  2012-11-01 16:34:45
  31554        1     4.0  1999-08-24 17:23:39
 117716        1     4.0  2001-03-28 07:20:04
  70358        1     3.0  2007-01-27 16:17:11
  70360        1     5.0  1997-03-16 20:52:42
  98815        1     5.0  2009-10-02 05:01:51
 106137        1     3.5  2006-06-03 11:32:48
  98816        1     4.0  1998-07-29 17:31:21
  18998        1     3.5  2010-07-10 23:28:11
  85495        1     4.0  2014-11-11 00:51:07
  40850        1     1.5  2003-10-05 02:11:50
  85494        1     5.0  2011-02-09 22:59:27
  31556        1     4.5  2011-12-18 05:51:59
  70366        1     3.0  1996-12-26 06:00:06
  12176        1     4.0  1997-07-13 20:12:56

每个movieId 有几行，具有由不同userId 给出的不同评级。我想获得每个 movieId 的平均评分。

这是我尝试过的方法：

rat_1 = pd.DataFrame()

for i in range(0,len(k)): # k is a list containing all the unique movieIds
    
    rat_2 = rating[rating['movieId']==k[i]] # Taking a subset of the original dataframe containing rows only of
                                            # the specified movieId 
    
    rat_2['rating']=sum(rat_2['rating'])/len(rat_2) # Calculating average rating
    

    
    rat_1 = pd.concat([rat_1,rat_2]) # Appending the subset dataframe to a new dataframe

但是，该文件相当大（大约 660 MB），因此代码执行时间过长。有没有更快的方法来做到这一点？
提前谢谢！
附：这是我第一次在这里发布问题，如果我的疑问不够清楚，我深表歉意。

【问题讨论】：

这能回答你的问题吗？ pandas get column average/mean with round value

标签： python pandas

【解决方案1】：

您应该使用groupby 和mean。

df.groupby("movieId")['rating'].mean()

【讨论】：

【解决方案2】：

如果您只想要评分，@taha 的答案适合您，但如果您想根据每条记录进行评分，我认为如下。

import pandas as pd
import numpy as np
import io

data = '''
id userid movieid rating timestamp
1 123 1 3.0 "2020-01-01 00:00:00"
2 121 1 4.0 "2020-01-01 00:00:00"
3 133 1 2.0 "2020-01-01 00:00:00"
4 144 2 1.0 "2020-01-01 00:00:00"
5 145 3 5.0 "2020-01-01 00:00:00"
6 167 3 3.5 "2020-01-01 00:00:00"
7 169 2 2.5 "2020-01-01 00:00:00"
8 254 1 4.5 "2020-01-01 00:00:00"
9 434 2 4.0 "2020-01-01 00:00:00"
10 534 3 3.5 "2020-01-01 00:00:00"
'''

df = pd.read_csv(io.StringIO(data), sep='\s+', index_col=0)

df['raiting_mean'] = df.groupby(['movieid'])['rating'].transform('mean')

df
    userid  movieid rating  timestamp   raiting_mean
id                  
1   123 1   3.0 2020-01-01 00:00:00 3.375
2   121 1   4.0 2020-01-01 00:00:00 3.375
3   133 1   2.0 2020-01-01 00:00:00 3.375
4   144 2   1.0 2020-01-01 00:00:00 2.500
5   145 3   5.0 2020-01-01 00:00:00 4.000
6   167 3   3.5 2020-01-01 00:00:00 4.000
7   169 2   2.5 2020-01-01 00:00:00 2.500
8   254 1   4.5 2020-01-01 00:00:00 3.375
9   434 2   4.0 2020-01-01 00:00:00 2.500
10  534 3   3.5 2020-01-01 00:00:00 4.000

【讨论】：

【解决方案3】：

只是想澄清一下为什么遍历所有电影很慢。基本上，python 中的for 循环很慢，因为它只是糖代码。所以你应该使用group by 和mean 作为@taha 回复，因为这些操作已经优化。

【讨论】：