文章介绍
Pandas 是基于 NumPy 的一种数据处理工具,该工具为了解决数据分析任务而创建。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的函数和方法。
Pandas 的数据结构:Pandas 主要有 Series(一维数组),DataFrame(二维数组),Panel(三维数组),Panel4D(四维数组),PanelND(更多维数组)等数据结构。其中 Series 和 DataFrame 应用的最为广泛。
- Series 是一维带标签的数组,它可以包含任何数据类型。包括整数,字符串,浮点数,Python 对象等。Series 可以通过标签来定位。
- DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。
知识点
- 创建Series
- Series基本操作
- 创建DataFrame
- DataFrame基本操作
- DataFrame文件操作
- Series,DataFrame和多索引
- 透视表
- 数据清洗
- 数据预处理
- 可视化
实验环境
基础部分
导入 Pandas 模块
- 导入 Pandas
练习 Pandas 之前,首先需要导入 Pandas 模块,并约定简称为 pd。
import pandas as pd
- 查看 Pandas 版本信息
pd.__version__
'0.23.4'
创建Series数据类型
Pandas 中,Series 可以被看作由 1 列数据组成的数据集。
- 从列表创建 Series
arr = [0,1,2,3,4]
s1 = pd.Series(arr)
s1
print("提示:前面的 0,1,2,3,4 为当前 Series 的索引,后面的 0,1,2,3,4 为 Series 的值。")
提示:前面的 0,1,2,3,4 为当前 Series 的索引,后面的 0,1,2,3,4 为 Series 的值。
- 从 Ndarray 创建 Series
import numpy as np
n = np.random.randn(5) # 创建一个随机 Ndarray 数组
index = ["a","b","c","d","e"]
s2 = pd.Series(n,index)
s2
a -0.592792
b 2.177081
c 2.292780
d -0.571098
e -1.295833
dtype: float64
- 从字典创建 Series
d= {"a":1, "b":2, "c":3,"d":4,"e":5 }
s3 = pd.Series(d)
s3
a 1
b 2
c 3
d 4
e 5
dtype: int64
Series 基本操作
- 修改 Series 索引
s1
0 0
1 1
2 2
3 3
4 4
dtype: int64
s1.index = ["a","b","c","d","e"]
s1
a 0
b 1
c 2
d 3
e 4
dtype: int64
- Series 纵向拼接
s2
a -0.592792
b 2.177081
c 2.292780
d -0.571098
e -1.295833
dtype: float64
s4 = s3.append(s2)# 将 s2 拼接到 s3
s4
a 1.000000
b 2.000000
c 3.000000
d 4.000000
e 5.000000
a -0.592792
b 2.177081
c 2.292780
d -0.571098
e -1.295833
dtype: float64
- Series 按指定索引删除元素
s4
a 1.000000
b 2.000000
c 3.000000
d 4.000000
e 5.000000
a -0.592792
b 2.177081
c 2.292780
d -0.571098
e -1.295833
dtype: float64
s4.drop("e")# 删除索引为 e 的值
a 1.000000
b 2.000000
c 3.000000
d 4.000000
a -0.592792
b 2.177081
c 2.292780
d -0.571098
dtype: float64
- Series 修改指定索引元素
s4["a"] = 6
s4
a 6.000000
b 2.000000
c 3.000000
d 4.000000
e 5.000000
a 6.000000
b 2.177081
c 2.292780
d -0.571098
e -1.295833
dtype: float64
- Series 按指定索引查找元素
s4["a"]
a 6.0
a 6.0
dtype: float64
- Series 切片操作
s4[:3]# 对s4的前 3 个数据访问
a 6.0
b 2.0
c 3.0
dtype: float64
Series 运算
- Series 加法运算
# Series 的加法运算是按照索引计算,如果索引不同则填充为 NaN(空值)。
s4,s3
(a 6.000000
b 2.000000
c 3.000000
d 4.000000
e 5.000000
a 6.000000
b 2.177081
c 2.292780
d -0.571098
e -1.295833
dtype: float64, a 1
b 2
c 3
d 4
e 5
dtype: int64)
s4.add(s3)
a 7.000000
a 7.000000
b 4.000000
b 4.177081
c 6.000000
c 5.292780
d 8.000000
d 3.428902
e 10.000000
e 3.704167
dtype: float64
- Series 减法运算
# Series的减法运算是按照索引对应计算,如果不同则填充为 NaN(空值)。
s4.sub(s3)
a 5.000000
a 5.000000
b 0.000000
b 0.177081
c 0.000000
c -0.707220
d 0.000000
d -4.571098
e 0.000000
e -6.295833
dtype: float64
- Series 乘法运算
# Series 的乘法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。
s4.mul(s3)
a 6.000000
a 6.000000
b 4.000000
b 4.354163
c 9.000000
c 6.878340
d 16.000000
d -2.284393
e 25.000000
e -6.479165
dtype: float64
- Series 除法运算
# Series 的除法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。
s4.div(s3)
a 6.000000
a 6.000000
b 1.000000
b 1.088541
c 1.000000
c 0.764260
d 1.000000
d -0.142775
e 1.000000
e -0.259167
dtype: float64
- Series 求中位数
s4.median()
2.646389957854695
- Series 求和
s4.sum()
28.60293011299616
- Series 求最大值
s4.max()
6.0
- Series 求最小值
s4.min()
-1.2958329899706196
创建 DataFrame 数据类型
与 Sereis 不同,DataFrame 可以存在多列数据。一般情况下,DataFrame 也更加常用。
- 通过 NumPy 数组创建 DataFrame
dates = pd.date_range("today",periods=6,freq="D")#定义时间序列作为index
num_arr = np.random.randn(6,4)#传入numpy随机数组
columns = ["A","B","C","D"]#将列表作为列名
df1 = pd.DataFrame(num_arr,index = dates,columns=columns)
df1
|
A |
B |
C |
D |
| 2019-03-26 15:44:37.283176 |
-0.379061 |
-0.905665 |
-0.241411 |
-0.058220 |
| 2019-03-27 15:44:37.283176 |
0.399333 |
-1.135517 |
-1.946507 |
-1.126837 |
| 2019-03-28 15:44:37.283176 |
-1.127666 |
0.268617 |
-0.974614 |
0.151761 |
| 2019-03-29 15:44:37.283176 |
-0.521538 |
-0.644654 |
0.045048 |
-1.206884 |
| 2019-03-30 15:44:37.283176 |
-0.419905 |
-0.145953 |
1.989872 |
-0.794930 |
| 2019-03-31 15:44:37.283176 |
1.703039 |
1.587834 |
0.867288 |
-1.096028 |
- 通过字典数组创建 DataFrame
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df2 = pd.DataFrame(data, index=labels)
df2
|
animal |
age |
visits |
priority |
| a |
cat |
2.5 |
1 |
yes |
| b |
cat |
3.0 |
3 |
yes |
| c |
snake |
0.5 |
2 |
no |
| d |
dog |
NaN |
3 |
yes |
| e |
dog |
5.0 |
2 |
no |
| f |
cat |
2.0 |
3 |
no |
| g |
snake |
4.5 |
1 |
no |
| h |
cat |
NaN |
1 |
yes |
| i |
dog |
7.0 |
2 |
no |
| j |
dog |
3.0 |
1 |
no |
- 查看 DataFrame 的数据类型
df2.dtypes
animal object
age float64
visits int64
priority object
dtype: object
DataFrame 基本操作
- 预览 DataFrame 的前 5 行数据
df2.head()# 默认为显示 5 行,可根据需要在括号中填入希望预览的行数
|
animal |
age |
visits |
priority |
| a |
cat |
2.5 |
1 |
yes |
| b |
cat |
3.0 |
3 |
yes |
| c |
snake |
0.5 |
2 |
no |
| d |
dog |
NaN |
3 |
yes |
| e |
dog |
5.0 |
2 |
no |
- 查看 DataFrame 的后 3 行数据
df2.tail(3)
|
animal |
age |
visits |
priority |
| h |
cat |
NaN |
1 |
yes |
| i |
dog |
7.0 |
2 |
no |
| j |
dog |
3.0 |
1 |
no |
25.查看 DataFrame 的索引
df2.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
- 查看 DataFrame 的列名
df2.columns
Index(['animal', 'age', 'visits', 'priority'], dtype='object')
- 查看 DataFrame 的数值
df2.values
array([['cat', 2.5, 1, 'yes'],
['cat', 3.0, 3, 'yes'],
['snake', 0.5, 2, 'no'],
['dog', nan, 3, 'yes'],
['dog', 5.0, 2, 'no'],
['cat', 2.0, 3, 'no'],
['snake', 4.5, 1, 'no'],
['cat', nan, 1, 'yes'],
['dog', 7.0, 2, 'no'],
['dog', 3.0, 1, 'no']], dtype=object)
- 查看 DataFrame 的统计数据
df2.describe()
|
age |
visits |
| count |
8.000000 |
10.000000 |
| mean |
3.437500 |
1.900000 |
| std |
2.007797 |
0.875595 |
| min |
0.500000 |
1.000000 |
| 25% |
2.375000 |
1.000000 |
| 50% |
3.000000 |
2.000000 |
| 75% |
4.625000 |
2.750000 |
| max |
7.000000 |
3.000000 |
- 对 DataFrame 进行按列排序
df2.sort_values(by = "age")# 按age排序 默认为升序
|
animal |
age |
visits |
priority |
| c |
snake |
0.5 |
2 |
no |
| f |
cat |
2.0 |
3 |
no |
| a |
cat |
2.5 |
1 |
yes |
| b |
cat |
3.0 |
3 |
yes |
| j |
dog |
3.0 |
1 |
no |
| g |
snake |
4.5 |
1 |
no |
| e |
dog |
5.0 |
2 |
no |
| i |
dog |
7.0 |
2 |
no |
| d |
dog |
NaN |
3 |
yes |
| h |
cat |
NaN |
1 |
yes |
- 对 DataFrame 数据切片
df2[1:3]
|
animal |
age |
visits |
priority |
| b |
cat |
3.0 |
3 |
yes |
| c |
snake |
0.5 |
2 |
no |
- 对 DataFrame 通过标签查询(单列)
df2.age
a 2.5
b 3.0
c 0.5
d NaN
e 5.0
f 2.0
g 4.5
h NaN
i 7.0
j 3.0
Name: age, dtype: float64
df2["age"]
a 2.5
b 3.0
c 0.5
d NaN
e 5.0
f 2.0
g 4.5
h NaN
i 7.0
j 3.0
Name: age, dtype: float64
- 对 DataFrame 通过标签查询(多列)
df2[["age","animal"]]#传入一个列名组成的列表
|
age |
animal |
| a |
2.5 |
cat |
| b |
3.0 |
cat |
| c |
0.5 |
snake |
| d |
NaN |
dog |
| e |
5.0 |
dog |
| f |
2.0 |
cat |
| g |
4.5 |
snake |
| h |
NaN |
cat |
| i |
7.0 |
dog |
| j |
3.0 |
dog |
- 对 DataFrame 通过位置查询
df2.iloc[1:3] # 查询 2,3 行
|
animal |
age |
visits |
priority |
| b |
cat |
3.0 |
3 |
yes |
| c |
snake |
0.5 |
2 |
no |
- DataFrame 副本拷贝
# 生成 DataFrame 副本,方便数据集被多个不同流程使用
df5 = df2.copy()
df5
|
animal |
age |
visits |
priority |
| a |
cat |
2.5 |
1 |
yes |
| b |
cat |
3.0 |
3 |
yes |
| c |
snake |
0.5 |
2 |
no |
| d |
dog |
NaN |
3 |
yes |
| e |
dog |
5.0 |
2 |
no |
| f |
cat |
2.0 |
3 |
no |
| g |
snake |
4.5 |
1 |
no |
| h |
cat |
NaN |
1 |
yes |
| i |
dog |
7.0 |
2 |
no |
| j |
dog |
3.0 |
1 |
no |
- 判断 DataFrame 元素是否为空
df5.isnull()# 如果为空则返回为Ture
|
animal |
age |
visits |
priority |
| a |
False |
False |
False |
False |
| b |
False |
False |
False |
False |
| c |
False |
False |
False |
False |
| d |
False |
True |
False |
False |
| e |
False |
False |
False |
False |
| f |
False |
False |
False |
False |
| g |
False |
False |
False |
False |
| h |
False |
True |
False |
False |
| i |
False |
False |
False |
False |
| j |
False |
False |
False |
False |
- 添加列数据
num = pd.Series([0,1,2,3,4,5,6,7,8,9],index = df5.index)
df5["No."] = num
df5
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| b |
cat |
3.0 |
3 |
yes |
1 |
| c |
snake |
0.5 |
2 |
no |
2 |
| d |
dog |
NaN |
3 |
yes |
3 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
2.0 |
3 |
no |
5 |
| g |
snake |
4.5 |
1 |
no |
6 |
| h |
cat |
NaN |
1 |
yes |
7 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
- 根据 DataFrame 的下标值进行更改。
# 修改第2行第一列对应的值3->2.0
df5.iat[1,0] =2
df5
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| b |
2 |
3.0 |
3 |
yes |
1 |
| c |
snake |
0.5 |
2 |
no |
2 |
| d |
dog |
NaN |
3 |
yes |
3 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
2.0 |
3 |
no |
5 |
| g |
snake |
4.5 |
1 |
no |
6 |
| h |
cat |
NaN |
1 |
yes |
7 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
- 根据 DataFrame 的标签对数据进行修改
df5.loc["f","age"] = 1.5
df5
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| b |
2 |
3.0 |
3 |
yes |
1 |
| c |
snake |
0.5 |
2 |
no |
2 |
| d |
dog |
NaN |
3 |
yes |
3 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
1.5 |
3 |
no |
5 |
| g |
snake |
4.5 |
1 |
no |
6 |
| h |
cat |
NaN |
1 |
yes |
7 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
- DataFrame 求平均值操作
df5.mean()
age 3.375
visits 1.900
No. 4.500
dtype: float64
- 对 DataFrame 中任意列做求和操作
df5["visits"].sum()
19
字符串操作
- 将字符串转化为小写字母
string = pd.Series(["A",'B','C','Aaba', 'Baca',np.nan, 'CABA', 'dog', 'cat'])
string
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
string.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
- 将字符串转化为大写字母
string
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
string.str.upper()
0 A
1 B
2 C
3 AABA
4 BACA
5 NaN
6 CABA
7 DOG
8 CAT
dtype: object
DataFrame 缺失值操作
- 对缺失值进行填充
df4 = df5.copy()
df4
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| b |
2 |
3.0 |
3 |
yes |
1 |
| c |
snake |
0.5 |
2 |
no |
2 |
| d |
dog |
NaN |
3 |
yes |
3 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
1.5 |
3 |
no |
5 |
| g |
snake |
4.5 |
1 |
no |
6 |
| h |
cat |
NaN |
1 |
yes |
7 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
df4.fillna(value = 3)
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| b |
2 |
3.0 |
3 |
yes |
1 |
| c |
snake |
0.5 |
2 |
no |
2 |
| d |
dog |
3.0 |
3 |
yes |
3 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
1.5 |
3 |
no |
5 |
| g |
snake |
4.5 |
1 |
no |
6 |
| h |
cat |
3.0 |
1 |
yes |
7 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
- 删除存在缺失值的行
df6 = df4.copy()
print(df5)
df5.dropna(how= "any")#任何存在NAN的行都被删除
animal age visits priority No.
a cat 2.5 1 yes 0
b 2 3.0 3 yes 1
c snake 0.5 2 no 2
d dog NaN 3 yes 3
e dog 5.0 2 no 4
f cat 1.5 3 no 5
g snake 4.5 1 no 6
h cat NaN 1 yes 7
i dog 7.0 2 no 8
j dog 3.0 1 no 9
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| b |
2 |
3.0 |
3 |
yes |
1 |
| c |
snake |
0.5 |
2 |
no |
2 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
1.5 |
3 |
no |
5 |
| g |
snake |
4.5 |
1 |
no |
6 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
- DataFrame 按指定列对齐
left = pd.DataFrame({"key":["foo1","foo2"],"one":[1,2]})
right = pd.DataFrame({"key":["foo2","foo3"],"two":[4,5]})
print(right)
print(left)
key two
0 foo2 4
1 foo3 5
key one
0 foo1 1
1 foo2 2
pd.merge(left,right,on = "key")
DataFrame 文件操作
- CSV 文件写入
df5.to_csv('animal.csv')
print("写入成功.")
写入成功.
- CSV 文件读取
df_animal = pd.read_csv("animal.csv")
df_animal
|
Unnamed: 0 |
animal |
age |
visits |
priority |
No. |
| 0 |
a |
cat |
2.5 |
1 |
yes |
0 |
| 1 |
b |
2 |
3.0 |
3 |
yes |
1 |
| 2 |
c |
snake |
0.5 |
2 |
no |
2 |
| 3 |
d |
dog |
NaN |
3 |
yes |
3 |
| 4 |
e |
dog |
5.0 |
2 |
no |
4 |
| 5 |
f |
cat |
1.5 |
3 |
no |
5 |
| 6 |
g |
snake |
4.5 |
1 |
no |
6 |
| 7 |
h |
cat |
NaN |
1 |
yes |
7 |
| 8 |
i |
dog |
7.0 |
2 |
no |
8 |
| 9 |
j |
dog |
3.0 |
1 |
no |
9 |
- Excel 写入操作
df5.to_excel("animal.xlsx",sheet_name="Sheet1")
print("写入成功")
写入成功
- Excel 读取操作
pd.read_excel("animal.xlsx","Sheet1",index_col = None,na_values=["NA"])
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| b |
2 |
3.0 |
3 |
yes |
1 |
| c |
snake |
0.5 |
2 |
no |
2 |
| d |
dog |
NaN |
3 |
yes |
3 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
1.5 |
3 |
no |
5 |
| g |
snake |
4.5 |
1 |
no |
6 |
| h |
cat |
NaN |
1 |
yes |
7 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
进阶部分
时间序列索引
- 建立一个以 2018 年每一天为索引,值为随机数的 Series
dti = pd.date_range(start = "2019-1-1",end = "2019-1-31",freq = "D")
s = pd.Series(np.random.rand(len(dti)),index =dti)
s
2019-01-01 0.867844
2019-01-02 0.255206
2019-01-03 0.520293
2019-01-04 0.256538
2019-01-05 0.071561
2019-01-06 0.037585
2019-01-07 0.111249
2019-01-08 0.656317
2019-01-09 0.094572
2019-01-10 0.807796
2019-01-11 0.546557
2019-01-12 0.229993
2019-01-13 0.872322
2019-01-14 0.205696
2019-01-15 0.602794
2019-01-16 0.043299
2019-01-17 0.589059
2019-01-18 0.616598
2019-01-19 0.661684
2019-01-20 0.899759
2019-01-21 0.358859
2019-01-22 0.573127
2019-01-23 0.336021
2019-01-24 0.061472
2019-01-25 0.667695
2019-01-26 0.874072
2019-01-27 0.402528
2019-01-28 0.868478
2019-01-29 0.903817
2019-01-30 0.020582
2019-01-31 0.999591
Freq: D, dtype: float64
- 统计s 中每一个周三对应值的和
# 周一从 0 开始
s[s.index.weekday ==2 ].sum()
0.7496805102635341
- 统计s中每个月值的平均值
s.resample("M").mean()
2019-01-31 0.484289
Freq: M, dtype: float64
- 将 Series 中的时间进行转换(秒转分钟)
s = pd.date_range("today",periods=10,freq = "S")#从现在往前推10个数 频率为"s"
ts = pd.Series(np.random.randint(0,500,len(s)),index =s)
ts.resample("Min").sum()
2019-03-27 09:18:00 3192
Freq: T, dtype: int32
- UTC 世界时间标准
s = pd.date_range("today",periods=1,freq= "D")# 获取当前的时间
ts = pd.Series(np.random.randn(len(s)),s)#随机数
ts_utc = ts.tz_localize("UTC")#转换为UTC时间
ts_utc
2019-03-27 09:23:38.287795+00:00 1.236968
Freq: D, dtype: float64
- 转换为上海所在时区
ts_utc.tz_convert("Asia/Shanghai")
2019-03-27 17:23:38.287795+08:00 1.236968
Freq: D, dtype: float64
57.不同时间表示方式的转换
rng = pd.date_range("2018-1-1",periods=5,freq="M")
ts = pd.Series(np.random.randn(len(rng)),index = rng)
print(ts)
ps = ts.to_period()
print(ps)
ps.to_timestamp()
2018-01-31 -0.616241
2018-02-28 2.232648
2018-03-31 0.712430
2018-04-30 -0.249444
2018-05-31 1.265058
Freq: M, dtype: float64
2018-01 -0.616241
2018-02 2.232648
2018-03 0.712430
2018-04 -0.249444
2018-05 1.265058
Freq: M, dtype: float64
2018-01-01 -0.616241
2018-02-01 2.232648
2018-03-01 0.712430
2018-04-01 -0.249444
2018-05-01 1.265058
Freq: MS, dtype: float64
Series 多重索引
- 创建多重索引 Series
构建一个 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))为索引,值为随机数的多重索引 Series。
letter = ["A","B","C"]
numbers = list(range(10))
mi = pd.MultiIndex.from_product([letter,numbers]) # 设置多重索引
s = pd.Series(np.random.rand(30),index=mi)# 随机数
print(mi)
print(s)
MultiIndex(levels=[['A', 'B', 'C'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
A 0 0.891213
1 0.879980
2 0.685745
3 0.546840
4 0.251824
5 0.932508
6 0.575709
7 0.422894
8 0.942871
9 0.850407
B 0 0.023095
1 0.853654
2 0.835034
3 0.191593
4 0.486485
5 0.695431
6 0.782958
7 0.938840
8 0.108407
9 0.686810
C 0 0.686884
1 0.512558
2 0.087879
3 0.303796
4 0.272153
5 0.433170
6 0.476725
7 0.602703
8 0.166886
9 0.929835
dtype: float64
- 多重索引 Series 查询
s.loc[:,[1,3,6]]# 查询索引为 1,3,6 的值
A 1 0.879980
3 0.546840
6 0.575709
B 1 0.853654
3 0.191593
6 0.782958
C 1 0.512558
3 0.303796
6 0.476725
dtype: float64
- 多重索引 Series 切片
s.loc[pd.IndexSlice[:"B",5:]]
A 5 0.932508
6 0.575709
7 0.422894
8 0.942871
9 0.850407
B 5 0.695431
6 0.782958
7 0.938840
8 0.108407
9 0.686810
dtype: float64
DataFrame 多重索引
- 根据多重索引创建 DataFrame
创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引,值为随机数据的多重索引 DataFrame。
frame = pd.DataFrame(np.arange(12).reshape(6,2),
index = [list("AAABBB"),list("123123")],
columns = ["hello","heitao"])
print(frame)
hello heitao
A 1 0 1
2 2 3
3 4 5
B 1 6 7
2 8 9
3 10 11
- 多重索引设置列名称
frame.index.names = ["first","second"]
frame
|
|
hello |
heitao |
| first |
second |
|
|
| A |
1 |
0 |
1 |
| 2 |
2 |
3 |
| 3 |
4 |
5 |
| B |
1 |
6 |
7 |
| 2 |
8 |
9 |
| 3 |
10 |
11 |
- DataFrame 多重索引分组求和
a = frame.groupby("second").sum()
b = frame.groupby('first').sum()
print(a)
print(b)
hello heitao
second
1 6 8
2 10 12
3 14 16
hello heitao
first
A 6 9
B 24 27
- DataFrame 行列名称转换
print(frame)
frame.stack()
hello heitao
first second
A 1 0 1
2 2 3
3 4 5
B 1 6 7
2 8 9
3 10 11
first second
A 1 hello 0
heitao 1
2 hello 2
heitao 3
3 hello 4
heitao 5
B 1 hello 6
heitao 7
2 hello 8
heitao 9
3 hello 10
heitao 11
dtype: int32
- DataFrame 索引转换
print(frame)
frame.unstack()
hello heitao
first second
A 1 0 1
2 2 3
3 4 5
B 1 6 7
2 8 9
3 10 11
|
hello |
heitao |
| second |
1 |
2 |
3 |
1 |
2 |
3 |
| first |
|
|
|
|
|
|
| A |
0 |
2 |
4 |
1 |
3 |
5 |
| B |
6 |
8 |
10 |
7 |
9 |
11 |
- DataFrame 条件查找
# 示例数据
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
|
animal |
age |
visits |
priority |
| a |
cat |
2.5 |
1 |
yes |
| b |
cat |
3.0 |
3 |
yes |
| c |
snake |
0.5 |
2 |
no |
| d |
dog |
NaN |
3 |
yes |
| e |
dog |
5.0 |
2 |
no |
| f |
cat |
2.0 |
3 |
no |
| g |
snake |
4.5 |
1 |
no |
| h |
cat |
NaN |
1 |
yes |
| i |
dog |
7.0 |
2 |
no |
| j |
dog |
3.0 |
1 |
no |
# 查找 age 大于 3 的全部信息
df[df["age"]>3]
|
animal |
age |
visits |
priority |
| e |
dog |
5.0 |
2 |
no |
| g |
snake |
4.5 |
1 |
no |
| i |
dog |
7.0 |
2 |
no |
- 根据行列索引切片
# 提取2-3行 1-2列数据
df.iloc[2:4,1:3]
|
age |
visits |
| c |
0.5 |
2 |
| d |
NaN |
3 |
- DataFrame 多重条件查询
# 查找 age<3 且为 cat 的全部数据。
df = pd.DataFrame(data, index=labels)
df[(df['animal'] == 'cat') & (df['age'] < 3)]
|
animal |
age |
visits |
priority |
| a |
cat |
2.5 |
1 |
yes |
| f |
cat |
2.0 |
3 |
no |
- DataFrame 按关键字查询
df5[df5["animal"].isin(["cat","dog"])]
|
animal |
age |
visits |
priority |
No. |
| a |
cat |
2.5 |
1 |
yes |
0 |
| d |
dog |
NaN |
3 |
yes |
3 |
| e |
dog |
5.0 |
2 |
no |
4 |
| f |
cat |
1.5 |
3 |
no |
5 |
| h |
cat |
NaN |
1 |
yes |
7 |
| i |
dog |
7.0 |
2 |
no |
8 |
| j |
dog |
3.0 |
1 |
no |
9 |
- DataFrame 按标签及列名查询
df.loc[df2.index[[1,4,8]],["animal","age"]]
|
animal |
age |
| b |
cat |
3.0 |
| e |
dog |
5.0 |
| i |
dog |
7.0 |
- DataFrame 多条件排序
# 按照 age 降序,visits 升序排列
df.sort_values(by = ["age","visits"],ascending=[False,True])
|
animal |
age |
visits |
priority |
| i |
dog |
7.0 |
2 |
no |
| e |
dog |
5.0 |
2 |
no |
| g |
snake |
4.5 |
1 |
no |
| j |
dog |
3.0 |
1 |
no |
| b |
cat |
3.0 |
3 |
yes |
| a |
cat |
2.5 |
1 |
yes |
| f |
cat |
2.0 |
3 |
no |
| c |
snake |
0.5 |
2 |
no |
| h |
cat |
NaN |
1 |
yes |
| d |
dog |
NaN |
3 |
yes |
72.DataFrame 多值替换
# 将 priority 列的 yes 值替换为 True,no 值替换为 False。
df["priority"].map({"yes":True,"no":False})
a True
b True
c False
d True
e False
f False
g False
h True
i False
j False
Name: priority, dtype: bool
- DataFrame 分组求和
df4.groupby("animal").sum()
|
age |
visits |
No. |
| animal |
|
|
|
| 2 |
3.0 |
3 |
1 |
| cat |
4.0 |
5 |
12 |
| dog |
15.0 |
8 |
24 |
| snake |
5.0 |
3 |
8 |
- 使用列表拼接多个 DataFrame
temp_df1 = pd.DataFrame(np.random.randn(5, 4)) # 生成由随机数组成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(5, 4)) # 生成由随机数组成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(5, 4)) # 生成由随机数组成的 DataFrame 3
print(temp_df1)
print(temp_df2)
print(temp_df3)
0 1 2 3
0 -0.253063 0.954525 -1.172300 0.606517
1 0.463186 -0.906526 0.825088 0.962917
2 -0.713786 0.882766 -0.420237 -1.459033
3 0.403616 1.151884 -0.531666 -0.520436
4 0.470516 2.051418 -0.016450 -1.705573
0 1 2 3
0 0.603258 2.760330 0.057452 0.485384
1 0.105471 -1.443187 -1.700213 -0.575068
2 -0.105902 -1.095674 1.195388 0.628840
3 -0.924175 0.222047 0.784529 -2.560038
4 0.195358 0.912300 0.449167 0.401706
0 1 2 3
0 1.033782 -2.558680 0.560865 0.544706
1 0.191179 -1.389802 0.764125 -0.389027
2 -0.478880 0.187512 2.066756 -0.299723
3 1.023034 -0.049412 0.512549 0.310045
4 0.673487 0.911935 1.784224 -0.303885
- 找出 DataFrame 表中和最小的列
df = pd.DataFrame(np.random.random(size= (5,10)),columns = list("abcdefghij"))
print(df)
df.sum().idxmin() # idxmax(), idxmin() 为 Series 函数返回最大最小值的索引值
a b c d e f g \
0 0.278099 0.959842 0.706874 0.961625 0.212731 0.708068 0.593584
1 0.318883 0.605056 0.866065 0.358940 0.456914 0.435066 0.085913
2 0.336288 0.999118 0.859492 0.066227 0.459371 0.357916 0.791723
3 0.054229 0.213986 0.320105 0.019798 0.940241 0.485156 0.908761
4 0.402781 0.169654 0.526144 0.003792 0.043333 0.143780 0.150949
h i j
0 0.648064 0.383351 0.038961
1 0.111805 0.653432 0.870372
2 0.097711 0.946468 0.545376
3 0.900600 0.130202 0.347659
4 0.468224 0.122481 0.914841
'a'
- DataFrame 中每个元素减去每一行的平均值
df = pd.DataFrame(np.random.random(size = (5,3)))
print(df)
df.sub(df.mean(axis =1),axis =0)
0 1 2
0 0.271421 0.734225 0.643852
1 0.947422 0.413795 0.300381
2 0.564317 0.080039 0.202109
3 0.265286 0.594217 0.935334
4 0.483712 0.419904 0.448885
|
0 |
1 |
2 |
| 0 |
-0.278412 |
0.184392 |
0.094019 |
| 1 |
0.393556 |
-0.140071 |
-0.253485 |
| 2 |
0.282162 |
-0.202116 |
-0.080046 |
| 3 |
-0.332993 |
-0.004062 |
0.337055 |
| 4 |
0.032878 |
-0.030930 |
-0.001948 |
- DataFrame 分组,并得到每一组中最大三个数之和
df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]})
print(df)
A B
0 a 12
1 a 345
2 a 3
3 b 1
4 b 45
5 c 14
6 a 4
7 a 52
8 b 54
9 c 23
10 c 235
11 c 21
12 b 57
13 b 3
14 c 87
df.groupby("A")["B"].nlargest(3).sum(level =0)
A
a 409
b 156
c 345
Name: B, dtype: int64
透视表
当分析庞大的数据时,为了更好的发掘数据特征之间的关系,且不破坏原数据,就可以利用透视表 pivot_table 进行操作。
- 透视表的创建
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)})
df
|
A |
B |
C |
D |
E |
| 0 |
one |
A |
foo |
-1.249303 |
-1.321149 |
| 1 |
one |
B |
foo |
0.970207 |
-0.194715 |
| 2 |
two |
C |
foo |
0.906078 |
0.210847 |
| 3 |
three |
A |
bar |
-0.671747 |
0.680809 |
| 4 |
one |
B |
bar |
-0.551172 |
0.121524 |
| 5 |
one |
C |
bar |
-1.764874 |
-0.047319 |
| 6 |
two |
A |
foo |
-1.205649 |
0.253456 |
| 7 |
three |
B |
foo |
-1.348864 |
-0.478615 |
| 8 |
one |
C |
foo |
0.346110 |
1.625945 |
| 9 |
one |
A |
bar |
1.684301 |
0.328802 |
| 10 |
two |
B |
bar |
-0.174703 |
-0.938865 |
| 11 |
three |
C |
bar |
0.584560 |
0.304935 |
pd.pivot_table(df,index = ["A","B"])
|
|
D |
E |
| A |
B |
|
|
| one |
A |
0.217499 |
-0.496173 |
| B |
0.209517 |
-0.036595 |
| C |
-0.709382 |
0.789313 |
| three |
A |
-0.671747 |
0.680809 |
| B |
-1.348864 |
-0.478615 |
| C |
0.584560 |
0.304935 |
| two |
A |
-1.205649 |
0.253456 |
| B |
-0.174703 |
-0.938865 |
| C |
0.906078 |
0.210847 |
- 透视表按指定列进行聚合
将该 DataFrame 的 D 列聚合,按照 A,B 列为索引进行聚合,聚合的方式为默认求均值。
pd.pivot_table(df, values=['D'], index=['A', 'B'])
|
|
D |
| A |
B |
|
| one |
A |
0.217499 |
| B |
0.209517 |
| C |
-0.709382 |
| three |
A |
-0.671747 |
| B |
-1.348864 |
| C |
0.584560 |
| two |
A |
-1.205649 |
| B |
-0.174703 |
| C |
0.906078 |
- 透视表聚合方式定义
上一题中 D 列聚合时,采用默认求均值的方法,若想使用更多的方式可以在 aggfunc 中实现。
pd.pivot_table(df,values = ["D"],index = ["A","B"],aggfunc = [np.sum,len])
|
|
sum |
len |
|
|
D |
D |
| A |
B |
|
|
| one |
A |
0.434998 |
2.0 |
| B |
0.419035 |
2.0 |
| C |
-1.418764 |
2.0 |
| three |
A |
-0.671747 |
1.0 |
| B |
-1.348864 |
1.0 |
| C |
0.584560 |
1.0 |
| two |
A |
-1.205649 |
1.0 |
| B |
-0.174703 |
1.0 |
| C |
0.906078 |
1.0 |
- 透视表利用额外列进行辅助分割
D 列按照 A,B 列进行聚合时,若关心 C 列对 D 列的影响,可以加入 columns 值进行分析。
pd.pivot_table(df,values = ["D"],index = ["A","B"],
columns = ["C"],aggfunc = np.sum)
|
|
D |
|
C |
bar |
foo |
| A |
B |
|
|
| one |
A |
1.684301 |
-1.249303 |
| B |
-0.551172 |
0.970207 |
| C |
-1.764874 |
0.346110 |
| three |
A |
-0.671747 |
NaN |
| B |
NaN |
-1.348864 |
| C |
0.584560 |
NaN |
| two |
A |
NaN |
-1.205649 |
| B |
-0.174703 |
NaN |
| C |
NaN |
0.906078 |
- 透视表的缺省值处理
在透视表中由于不同的聚合方式,相应缺少的组合将为缺省值,可以加入 fill_value 对缺省值处理。
pd.pivot_table(df, values=['D'], index=['A', 'B'],
columns=['C'], aggfunc=np.sum, fill_value=0)
|
|
D |
|
C |
bar |
foo |
| A |
B |
|
|
| one |
A |
1.684301 |
-1.249303 |
| B |
-0.551172 |
0.970207 |
| C |
-1.764874 |
0.346110 |
| three |
A |
-0.671747 |
0.000000 |
| B |
0.000000 |
-1.348864 |
| C |
0.584560 |
0.000000 |
| two |
A |
0.000000 |
-1.205649 |
| B |
-0.174703 |
0.000000 |
| C |
0.000000 |
0.906078 |
绝对类型
在数据的形式上主要包括数量型和性质型,数量型表示着数据可数范围可变,而性质型表示范围已经确定不可改变,绝对型数据就是性质型数据的一种。
df = pd.DataFrame({"id":[1,2,3,4,5,6],
"raw_grade":["a","b","b","a","a","e"]})
df["grade"] = df["raw_grade"].astype("category")
df
|
id |
raw_grade |
grade |
| 0 |
1 |
a |
a |
| 1 |
2 |
b |
b |
| 2 |
3 |
b |
b |
| 3 |
4 |
a |
a |
| 4 |
5 |
a |
a |
| 5 |
6 |
e |
e |
- 对绝对型数据重命名
df["grade"].cat.categories = ["very good","good","very bad"]
df
|
id |
raw_grade |
grade |
| 0 |
1 |
a |
very good |
| 1 |
2 |
b |
good |
| 2 |
3 |
b |
good |
| 3 |
4 |
a |
very good |
| 4 |
5 |
a |
very good |
| 5 |
6 |
e |
very bad |
- 重新排列绝对型数据并补充相应的缺省值
df["grade"] = df["grade"].cat.set_categories(
["very bad", "bad", "medium", "good", "very good"])
df
|
id |
raw_grade |
grade |
| 0 |
1 |
a |
very good |
| 1 |
2 |
b |
good |
| 2 |
3 |
b |
good |
| 3 |
4 |
a |
very good |
| 4 |
5 |
a |
very good |
| 5 |
6 |
e |
very bad |
- 对绝对型数据进行排序
df.sort_values(by = "grade")
|
id |
raw_grade |
grade |
| 5 |
6 |
e |
very bad |
| 1 |
2 |
b |
good |
| 2 |
3 |
b |
good |
| 0 |
1 |
a |
very good |
| 3 |
4 |
a |
very good |
| 4 |
5 |
a |
very good |
数据清洗
常常我们得到的数据是不符合我们最终处理的数据要求,包括许多缺省值以及坏的数据,需要我们对数据进行清洗。
- 缺失值拟合
在FilghtNumber中有数值缺失,其中数值为按 10 增长,补充相应的缺省值使得数据完整,并让数据为 int 类型。
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
'Budapest_PaRis', 'Brussels_londOn'],
'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
'12. Air France', '"Swiss Air"']})
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df
|
From_To |
FlightNumber |
RecentDelays |
Airline |
| 0 |
LoNDon_paris |
10045 |
[23, 47] |
KLM(!) |
| 1 |
MAdrid_miLAN |
10055 |
[] |
<Air France> (12) |
| 2 |
londON_StockhOlm |
10065 |
[24, 43, 87] |
(British Airways. ) |
| 3 |
Budapest_PaRis |
10075 |
[13] |
12. Air France |
| 4 |
Brussels_londOn |
10085 |
[67, 32] |
"Swiss Air" |
- 数据列拆分
其中From_to应该为两独立的两列From和To,将From_to依照_拆分为独立两列建立为一个新表。
temp = df.From_To.str.split("_",expand = True)
temp.columns = ["From","To"]
temp
|
From |
To |
| 0 |
LoNDon |
paris |
| 1 |
MAdrid |
miLAN |
| 2 |
londON |
StockhOlm |
| 3 |
Budapest |
PaRis |
| 4 |
Brussels |
londOn |
90. 字符标准化
其中注意到地点的名字都不规范(如:londON应该为London)需要对数据进行标准化处理。
temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()
temp
|
From |
To |
| 0 |
London |
Paris |
| 1 |
Madrid |
Milan |
| 2 |
London |
Stockholm |
| 3 |
Budapest |
Paris |
| 4 |
Brussels |
London |
- 删除坏数据加入整理好的数据
将最开始的From_to列删除,加入整理好的From和to列
df = df.drop("From_To",axis=1)
df = df.join(temp)
df
|
FlightNumber |
RecentDelays |
Airline |
From |
To |
| 0 |
10045 |
[23, 47] |
KLM(!) |
London |
Paris |
| 1 |
10055 |
[] |
<Air France> (12) |
Madrid |
Milan |
| 2 |
10065 |
[24, 43, 87] |
(British Airways. ) |
London |
Stockholm |
| 3 |
10075 |
[13] |
12. Air France |
Budapest |
Paris |
| 4 |
10085 |
[67, 32] |
"Swiss Air" |
Brussels |
London |
- 去除多余字符
如同 airline 列中许多数据有许多其他字符,会对后期的数据分析有较大影响,需要对这类数据进行修正。
df['Airline'] = df['Airline'].str.extract(
'([a-zA-Z\s]+)', expand=False).str.strip()
df
|
FlightNumber |
RecentDelays |
Airline |
From |
To |
| 0 |
10045 |
[23, 47] |
KLM |
London |
Paris |
| 1 |
10055 |
[] |
Air France |
Madrid |
Milan |
| 2 |
10065 |
[24, 43, 87] |
British Airways |
London |
Stockholm |
| 3 |
10075 |
[13] |
Air France |
Budapest |
Paris |
| 4 |
10085 |
[67, 32] |
Swiss Air |
Brussels |
London |
- 格式规范
在 RecentDelays 中记录的方式为列表类型,由于其长度不一,这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开,取出列表中的相同位置元素作为一列,若为空值即用 NaN 代替。
delays = df["RecentDelays"].apply(pd.Series)
delays.columns = ["delay_{}".format(n) for n in range(1,len(delays.columns)+1)]
delays
|
delay_1 |
delay_2 |
delay_3 |
| 0 |
23.0 |
47.0 |
NaN |
| 1 |
NaN |
NaN |
NaN |
| 2 |
24.0 |
43.0 |
87.0 |
| 3 |
13.0 |
NaN |
NaN |
| 4 |
67.0 |
32.0 |
NaN |
df = df.drop("RecentDelays",axis =1).join(delays)
print(df)
FlightNumber Airline From To delay_1 delay_2 \
0 10045 KLM London Paris 23.0 47.0
1 10055 Air France Madrid Milan NaN NaN
2 10065 British Airways London Stockholm 24.0 43.0
3 10075 Air France Budapest Paris 13.0 NaN
4 10085 Swiss Air Brussels London 67.0 32.0
delay_3
0 NaN
1 NaN
2 87.0
3 NaN
4 NaN
数据预处理
- 信息区间划分
班级一部分同学的数学成绩表,如下图所示
df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],'grades':[58,83,79,65,93,45,61,88]})
df
|
name |
grades |
| 0 |
Alice |
58 |
| 1 |
Bob |
83 |
| 2 |
Candy |
79 |
| 3 |
Dany |
65 |
| 4 |
Ella |
93 |
| 5 |
Frank |
45 |
| 6 |
Grace |
61 |
| 7 |
Jenny |
88 |
但我们更加关心的是该同学是否及格,将该数学成绩按照是否>60来进行划分。
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
'Frank', 'Grace', 'Jenny'], 'grades': [58, 83, 79, 65, 93, 45, 61, 88]})
def choice(x):
if x > 60:
return 1
else:
return 0
df.grades = pd.Series(map(lambda x: choice(x), df.grades))
df
|
name |
grades |
| 0 |
Alice |
0 |
| 1 |
Bob |
1 |
| 2 |
Candy |
1 |
| 3 |
Dany |
1 |
| 4 |
Ella |
1 |
| 5 |
Frank |
0 |
| 6 |
Grace |
1 |
| 7 |
Jenny |
1 |
- 数据去重
# 一个列为A的 DataFrame 数据,如下图所示,尝试将 A 列中连续重复的数据清除。
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df
|
A |
| 0 |
1 |
| 1 |
2 |
| 2 |
2 |
| 3 |
3 |
| 4 |
4 |
| 5 |
5 |
| 6 |
5 |
| 7 |
5 |
| 8 |
6 |
| 9 |
7 |
| 10 |
7 |
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.loc[df['A'].shift() != df['A']]
|
A |
| 0 |
1 |
| 1 |
2 |
| 3 |
3 |
| 4 |
4 |
| 5 |
5 |
| 8 |
6 |
| 9 |
7 |
- 数据归一化
有时候,DataFrame 中不同列之间的数据差距太大,需要对其进行归一化处理。 其中,Max-Min 归一化是简单而常见的一种方式,公式如下:
Y=Xmax−XminX−Xmin
def normalization(df):
numerator = df.sub(df.min())
denominator = (df.max()).sub(df.min())
Y = numerator.div(denominator)
return Y
df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
normalization(df)
0 1 2
0 0.785455 0.116930 0.317938
1 0.723757 0.272358 0.961036
2 0.032597 0.493209 0.258063
3 0.515626 0.771451 0.379079
4 0.094327 0.641445 0.968139
|
0 |
1 |
2 |
| 0 |
1.000000 |
0.000000 |
0.084322 |
| 1 |
0.918048 |
0.237469 |
0.989997 |
| 2 |
0.000000 |
0.574892 |
0.000000 |
| 3 |
0.641593 |
1.000000 |
0.170427 |
| 4 |
0.081994 |
0.801372 |
1.000000 |
Pandas 绘图操作
为了更好的了解数据包含的信息,最直观的方法就是将其绘制成图。
%matplotlib inline
ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
ts = ts.cumsum()
ts.plot()

- DataFrame 折线图
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

- DataFrame 散点图
df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
df = df.cumsum()
df.plot.scatter("xs", "ys", color='red', marker="*")

- DataFrame 柱形图
df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
"advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
"month": range(12)
})
ax = df.plot.bar("month", "revenue", color="yellow")
df.plot("month", "advertising", secondary_y=True, ax=ax)

本文章主要作为学习历程的记录,如有侵权,请联系作者删除,原文地址https://www.shiyanlou.com/courses/