欢迎加入python学习交流群 667279387
学习笔记汇总
Pandas学习(一)–数据的导入
pandas学习(二)–双色球数据分析
pandas学习(三)–NAB球员薪资分析
pandas学习(四)–数据的归一化
pandas学习(五)–pandas学习视频
本篇文章主要利用NBA球员的薪资数据处理来进一步学习pandas这个数据处理工具。
1、获取数据并保存
本文从网站:下载网站 来获取2017-2018年各位NBA球员的薪资情况,代码如下:
import pandas as pd
data = pd.DataFrame()
url_list = [\'http://www.espn.com/nba/salaries/_/seasontype/4\']
for i in range(2, 13):
url = \'http://www.espn.com/nba/salaries/_/page/%s/seasontype/4\' % i
url_list.append(url)
for url in url_list:
data = data.append(pd.read_html(url), ignore_index=True)
data = data[[x.startswith(\'$\') for x in data[3]]]
data.to_csv(\'NAB_salaries.csv\',header=[\'RK\',\'NAME\',\'TEAM\',\'SALARY\'], index=False)
获取到的数据薪资前面10的数据如下
RK NAME TEAM SALARY
0 1 Stephen Curry, PG Golden State Warriors $34,382,550
1 2 LeBron James, SF Cleveland Cavaliers $33,285,709
2 3 Paul Millsap, PF Denver Nuggets $31,269,231
3 4 Gordon Hayward, SF Boston Celtics $29,727,900
4 5 Blake Griffin, PF LA Clippers $29,512,900
5 6 Kyle Lowry, PG Toronto Raptors $28,703,704
6 7 Mike Conley, PG Memphis Grizzlies $28,530,608
7 8 Russell Westbrook, PG Oklahoma City Thunder $28,530,608
8 9 James Harden, SG Houston Rockets $28,299,399
9 10 DeMar DeRozan, SG Toronto Raptors $27,739,975
2、分析数据
2.1、统计一个球队的所有球员薪资总和
# -*coding:utf-8*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams[\'font.sans-serif\'] = [\'SimHei\'] # 用来正常显示中文标签
money2int = lambda x: "".join(filter(str.isdigit, x))
team_name = lambda x: x.split()[-1]
salary = pd.read_csv(\'./NAB_salaries.csv\', usecols=[\'NAME\', \'TEAM\', \'SALARY\'], converters={\'SALARY\': money2int, \'TEAM\': team_name})
salary[\'SALARY\'] = salary[\'SALARY\'].astype(np.int)
salary = salary.groupby([\'TEAM\'], as_index=False).sum()
salary_sorted = salary.sort_values(\'SALARY\',ascending=False)
salary_sorted.index = salary_sorted[\'TEAM\']
salary_sorted.plot(kind=\'bar\', align=\'center\', title=\'球队队员工资共和($)\')
plt.xlabel(\'球队名\')
plt.ylabel(\'队员工资共和\')
plt.show()
TEAM SALARY
1 Blazers 134302107
4 Cavaliers 132016201
28 Warriors 128211882
11 Jazz 122981295
10 Hornets 121972410
从统计数据可以看出Blazers(波特兰开拓者队)支付球员薪水花费最大。
2.1、统计多个球队的所有球员薪资分布情况
# -*coding:utf-8*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams[\'font.sans-serif\'] = [\'SimHei\'] # 用来正常显示中文标签
money2int = lambda x: "".join(filter(str.isdigit, x))
team_name = lambda x: x.split()[-1]
get_name = lambda x: x.split(\',\')[0]
salary = pd.read_csv(\'./NAB_salaries.csv\', usecols=[\'NAME\', \'TEAM\', \'SALARY\'],
converters={\'SALARY\': money2int, \'NAME\': get_name, \'TEAM\': team_name})
salary[\'SALARY\'] = salary[\'SALARY\'].astype(np.int)
data = pd.DataFrame({"Cavaliers": salary[salary[\'TEAM\'] == \'Cavaliers\'][\'SALARY\'],
"Warriors": salary[salary[\'TEAM\'] == \'Warriors\'][\'SALARY\'],
"Rockets": salary[salary[\'TEAM\'] == \'Rockets\'][\'SALARY\'],
"Lakers": salary[salary[\'TEAM\'] == \'Lakers\'][\'SALARY\']})
#合并后面的数据有比较多的NAN数据,但是画图的时候会自动忽略。
#没有找到更好的合并方法,因为球队的队员人数不同,
#如果相同的话,可以转成list再合成dataframe就不会有NAN数据了。
plt.ylabel("球员薪资(单位:$)")
plt.xlabel("球队名")
data.boxplot()
plt.show()
本例子中选取了骑士、湖人、火箭、勇士队来进行分析,作出了
四个球队的所有队员薪资分布的箱图。从图中可以看出湖人队的薪资比较平均,勇士队的薪资跨度比较大。
2.3 统计不同类型球员工资
# -*coding:utf-8*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
money2int = lambda x: "".join(filter(str.isdigit, x))
team_name = lambda x: x.split()[-1]
salary = pd.read_csv(\'./NAB_salaries.csv\', usecols=[\'NAME\', \'TEAM\', \'SALARY\'],
converters={\'SALARY\': money2int, \'TEAM\': team_name})
salary[\'SALARY\'] = salary[\'SALARY\'].astype(np.int)
#原始数据的NAME列是类似(Stephen Curry, PG),
#以下3行代码主要是为了实现将NAME一列拆分两列变为Stephen Curry一列,PG为一列
salary.insert(1, \'POSITION\', salary[\'NAME\'])
salary[\'NAME\'] = salary[\'NAME\'].map(lambda x: x.split(\',\')[0])
salary[\'POSITION\'] = salary[\'POSITION\'].map(lambda x: x.split(\',\')[1])
# C:Center 中锋
# PF: Power Forward 大前锋
# SF: Small Forward 小前锋
# SG: Shooting Guard 得分后卫
# PG: Point Guard 组织后卫
#print(salary.groupby(\'POSITION\').sum()) #统计各个类型的薪水
#print(salary.groupby(\'POSITION\').describe())#各个类型的数理统计结果
print(salary.groupby(\'POSITION\').mean())
下面是各个类型的平均工资,可以看出SF的平均薪水最高
C 7808847
F 2770083
G 1685802
PF 6278746
PG 7112007
SF 7886812
SG 6589922
还有很多东西可以从这个数据当中去挖掘,每次实现一个功能,都进一步熟悉了pandas~为后续深入数据分析学习做好了准备~
欢迎python爱好者加入:学习交流群 667279387