Pandas百题斩 - 爱码网

文章介绍

Pandas 是基于 NumPy 的一种数据处理工具，该工具为了解决数据分析任务而创建。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的函数和方法。

Pandas 的数据结构：Pandas 主要有 Series（一维数组），DataFrame（二维数组），Panel（三维数组），Panel4D（四维数组），PanelND（更多维数组）等数据结构。其中 Series 和 DataFrame 应用的最为广泛。

Series 是一维带标签的数组，它可以包含任何数据类型。包括整数，字符串，浮点数，Python 对象等。Series 可以通过标签来定位。
DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。

知识点

创建Series
Series基本操作
创建DataFrame
DataFrame基本操作
DataFrame文件操作
Series，DataFrame和多索引
透视表
数据清洗
数据预处理
可视化

实验环境

Python 3.6
NumPy
Pandas

基础部分

导入 Pandas 模块

导入 Pandas

练习 Pandas 之前，首先需要导入 Pandas 模块，并约定简称为 pd。

import pandas as pd

查看 Pandas 版本信息

pd.__version__

'0.23.4'

创建Series数据类型

Pandas 中，Series 可以被看作由 1 列数据组成的数据集。

从列表创建 Series

arr = [0,1,2,3,4]
s1 = pd.Series(arr)
s1
print("提示：前面的 0,1,2,3,4 为当前 Series 的索引，后面的 0,1,2,3,4 为 Series 的值。")

提示：前面的 0,1,2,3,4 为当前 Series 的索引，后面的 0,1,2,3,4 为 Series 的值。

从 Ndarray 创建 Series

import numpy as np
n = np.random.randn(5) # 创建一个随机 Ndarray 数组
index = ["a","b","c","d","e"]
s2 = pd.Series(n,index)
s2

a   -0.592792
b    2.177081
c    2.292780
d   -0.571098
e   -1.295833
dtype: float64

从字典创建 Series

d= {"a":1, "b":2, "c":3,"d":4,"e":5 }
s3 = pd.Series(d)
s3

a    1
b    2
c    3
d    4
e    5
dtype: int64

Series 基本操作

修改 Series 索引

s1

0    0
1    1
2    2
3    3
4    4
dtype: int64

s1.index = ["a","b","c","d","e"]
s1

a    0
b    1
c    2
d    3
e    4
dtype: int64

Series 纵向拼接

s2

a   -0.592792
b    2.177081
c    2.292780
d   -0.571098
e   -1.295833
dtype: float64

s4 = s3.append(s2)# 将 s2 拼接到 s3
s4

a    1.000000
b    2.000000
c    3.000000
d    4.000000
e    5.000000
a   -0.592792
b    2.177081
c    2.292780
d   -0.571098
e   -1.295833
dtype: float64

Series 按指定索引删除元素

s4

a    1.000000
b    2.000000
c    3.000000
d    4.000000
e    5.000000
a   -0.592792
b    2.177081
c    2.292780
d   -0.571098
e   -1.295833
dtype: float64

s4.drop("e")# 删除索引为 e 的值

a    1.000000
b    2.000000
c    3.000000
d    4.000000
a   -0.592792
b    2.177081
c    2.292780
d   -0.571098
dtype: float64

Series 修改指定索引元素

s4["a"] = 6
s4

a    6.000000
b    2.000000
c    3.000000
d    4.000000
e    5.000000
a    6.000000
b    2.177081
c    2.292780
d   -0.571098
e   -1.295833
dtype: float64

Series 按指定索引查找元素

s4["a"]

a    6.0
a    6.0
dtype: float64

Series 切片操作

s4[:3]# 对s4的前 3 个数据访问

a    6.0
b    2.0
c    3.0
dtype: float64

Series 运算

Series 加法运算

# Series 的加法运算是按照索引计算，如果索引不同则填充为 NaN（空值）。
s4,s3

(a    6.000000
 b    2.000000
 c    3.000000
 d    4.000000
 e    5.000000
 a    6.000000
 b    2.177081
 c    2.292780
 d   -0.571098
 e   -1.295833
 dtype: float64, a    1
 b    2
 c    3
 d    4
 e    5
 dtype: int64)

s4.add(s3)

a     7.000000
a     7.000000
b     4.000000
b     4.177081
c     6.000000
c     5.292780
d     8.000000
d     3.428902
e    10.000000
e     3.704167
dtype: float64

Series 减法运算

# Series的减法运算是按照索引对应计算，如果不同则填充为 NaN（空值）。
s4.sub(s3)

a    5.000000
a    5.000000
b    0.000000
b    0.177081
c    0.000000
c   -0.707220
d    0.000000
d   -4.571098
e    0.000000
e   -6.295833
dtype: float64

Series 乘法运算

# Series 的乘法运算是按照索引对应计算，如果索引不同则填充为 NaN（空值）。
s4.mul(s3)

a     6.000000
a     6.000000
b     4.000000
b     4.354163
c     9.000000
c     6.878340
d    16.000000
d    -2.284393
e    25.000000
e    -6.479165
dtype: float64

Series 除法运算

# Series 的除法运算是按照索引对应计算，如果索引不同则填充为 NaN（空值）。
s4.div(s3)

a    6.000000
a    6.000000
b    1.000000
b    1.088541
c    1.000000
c    0.764260
d    1.000000
d   -0.142775
e    1.000000
e   -0.259167
dtype: float64

Series 求中位数

s4.median()

2.646389957854695

Series 求和

s4.sum()

28.60293011299616

Series 求最大值

s4.max()

6.0

Series 求最小值

s4.min()

-1.2958329899706196

创建 DataFrame 数据类型

与 Sereis 不同，DataFrame 可以存在多列数据。一般情况下，DataFrame 也更加常用。

通过 NumPy 数组创建 DataFrame

dates = pd.date_range("today",periods=6,freq="D")#定义时间序列作为index
num_arr = np.random.randn(6,4)#传入numpy随机数组
columns = ["A","B","C","D"]#将列表作为列名
df1 = pd.DataFrame(num_arr,index = dates,columns=columns)
df1

	A	B	C	D
2019-03-26 15:44:37.283176	-0.379061	-0.905665	-0.241411	-0.058220
2019-03-27 15:44:37.283176	0.399333	-1.135517	-1.946507	-1.126837
2019-03-28 15:44:37.283176	-1.127666	0.268617	-0.974614	0.151761
2019-03-29 15:44:37.283176	-0.521538	-0.644654	0.045048	-1.206884
2019-03-30 15:44:37.283176	-0.419905	-0.145953	1.989872	-0.794930
2019-03-31 15:44:37.283176	1.703039	1.587834	0.867288	-1.096028

通过字典数组创建 DataFrame

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df2 = pd.DataFrame(data, index=labels)
df2

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

查看 DataFrame 的数据类型

df2.dtypes

animal       object
age         float64
visits        int64
priority     object
dtype: object

DataFrame 基本操作

预览 DataFrame 的前 5 行数据

df2.head()# 默认为显示 5 行，可根据需要在括号中填入希望预览的行数

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no

查看 DataFrame 的后 3 行数据

df2.tail(3)

	animal	age	visits	priority
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

25.查看 DataFrame 的索引

df2.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

查看 DataFrame 的列名

df2.columns

Index(['animal', 'age', 'visits', 'priority'], dtype='object')

查看 DataFrame 的数值

df2.values

array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

查看 DataFrame 的统计数据

df2.describe()

	age	visits
count	8.000000	10.000000
mean	3.437500	1.900000
std	2.007797	0.875595
min	0.500000	1.000000
25%	2.375000	1.000000
50%	3.000000	2.000000
75%	4.625000	2.750000
max	7.000000	3.000000

对 DataFrame 进行按列排序

df2.sort_values(by = "age")# 按age排序  默认为升序

	animal	age	visits	priority
c	snake	0.5	2	no
f	cat	2.0	3	no
a	cat	2.5	1	yes
b	cat	3.0	3	yes
j	dog	3.0	1	no
g	snake	4.5	1	no
e	dog	5.0	2	no
i	dog	7.0	2	no
d	dog	NaN	3	yes
h	cat	NaN	1	yes

对 DataFrame 数据切片

df2[1:3]

	animal	age	visits	priority
b	cat	3.0	3	yes
c	snake	0.5	2	no

对 DataFrame 通过标签查询（单列）

df2.age

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

df2["age"]

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

对 DataFrame 通过标签查询（多列）

df2[["age","animal"]]#传入一个列名组成的列表

	age	animal
a	2.5	cat
b	3.0	cat
c	0.5	snake
d	NaN	dog
e	5.0	dog
f	2.0	cat
g	4.5	snake
h	NaN	cat
i	7.0	dog
j	3.0	dog

对 DataFrame 通过位置查询

df2.iloc[1:3] # 查询 2，3 行

	animal	age	visits	priority
b	cat	3.0	3	yes
c	snake	0.5	2	no

DataFrame 副本拷贝

# 生成 DataFrame 副本，方便数据集被多个不同流程使用
df5 = df2.copy()
df5

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

判断 DataFrame 元素是否为空

df5.isnull()# 如果为空则返回为Ture

	animal	age	visits	priority
a	False	False	False	False
b	False	False	False	False
c	False	False	False	False
d	False	True	False	False
e	False	False	False	False
f	False	False	False	False
g	False	False	False	False
h	False	True	False	False
i	False	False	False	False
j	False	False	False	False

添加列数据

num = pd.Series([0,1,2,3,4,5,6,7,8,9],index = df5.index)

df5["No."] = num
df5

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

根据 DataFrame 的下标值进行更改。

# 修改第2行第一列对应的值3->2.0
df5.iat[1,0] =2
df5

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	2	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

根据 DataFrame 的标签对数据进行修改

df5.loc["f","age"]  = 1.5
df5

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	2	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

DataFrame 求平均值操作

df5.mean()

age       3.375
visits    1.900
No.       4.500
dtype: float64

对 DataFrame 中任意列做求和操作

df5["visits"].sum()

字符串操作

将字符串转化为小写字母

string = pd.Series(["A",'B','C','Aaba', 'Baca',np.nan, 'CABA', 'dog', 'cat'])

string

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

string.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

将字符串转化为大写字母

string

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

string.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

DataFrame 缺失值操作

对缺失值进行填充

df4 = df5.copy()

df4

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	2	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

df4.fillna(value = 3)

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	2	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	3.0	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	3.0	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

删除存在缺失值的行

df6 = df4.copy()
print(df5)
df5.dropna(how= "any")#任何存在NAN的行都被删除

  animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b      2  3.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	2	3.0	3	yes	1
c	snake	0.5	2	no	2
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

DataFrame 按指定列对齐

left = pd.DataFrame({"key":["foo1","foo2"],"one":[1,2]})
right = pd.DataFrame({"key":["foo2","foo3"],"two":[4,5]})
print(right)
print(left)

    key  two
0  foo2    4
1  foo3    5
    key  one
0  foo1    1
1  foo2    2

pd.merge(left,right,on = "key")

	key	one	two
0	foo2	2	4

DataFrame 文件操作

CSV 文件写入

df5.to_csv('animal.csv')
print("写入成功.")

写入成功.

CSV 文件读取

df_animal = pd.read_csv("animal.csv")
df_animal

	Unnamed: 0	animal	age	visits	priority	No.
0	a	cat	2.5	1	yes	0
1	b	2	3.0	3	yes	1
2	c	snake	0.5	2	no	2
3	d	dog	NaN	3	yes	3
4	e	dog	5.0	2	no	4
5	f	cat	1.5	3	no	5
6	g	snake	4.5	1	no	6
7	h	cat	NaN	1	yes	7
8	i	dog	7.0	2	no	8
9	j	dog	3.0	1	no	9

Excel 写入操作

df5.to_excel("animal.xlsx",sheet_name="Sheet1")
print("写入成功")

写入成功

Excel 读取操作

pd.read_excel("animal.xlsx","Sheet1",index_col = None,na_values=["NA"])

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	2	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

进阶部分

时间序列索引

建立一个以 2018 年每一天为索引，值为随机数的 Series

dti = pd.date_range(start = "2019-1-1",end = "2019-1-31",freq = "D")
s = pd.Series(np.random.rand(len(dti)),index =dti)
s

2019-01-01    0.867844
2019-01-02    0.255206
2019-01-03    0.520293
2019-01-04    0.256538
2019-01-05    0.071561
2019-01-06    0.037585
2019-01-07    0.111249
2019-01-08    0.656317
2019-01-09    0.094572
2019-01-10    0.807796
2019-01-11    0.546557
2019-01-12    0.229993
2019-01-13    0.872322
2019-01-14    0.205696
2019-01-15    0.602794
2019-01-16    0.043299
2019-01-17    0.589059
2019-01-18    0.616598
2019-01-19    0.661684
2019-01-20    0.899759
2019-01-21    0.358859
2019-01-22    0.573127
2019-01-23    0.336021
2019-01-24    0.061472
2019-01-25    0.667695
2019-01-26    0.874072
2019-01-27    0.402528
2019-01-28    0.868478
2019-01-29    0.903817
2019-01-30    0.020582
2019-01-31    0.999591
Freq: D, dtype: float64

统计s 中每一个周三对应值的和

# 周一从 0 开始
s[s.index.weekday ==2 ].sum()

0.7496805102635341

统计s中每个月值的平均值

s.resample("M").mean()

2019-01-31    0.484289
Freq: M, dtype: float64

将 Series 中的时间进行转换（秒转分钟）

s = pd.date_range("today",periods=10,freq = "S")#从现在往前推10个数  频率为"s"

ts = pd.Series(np.random.randint(0,500,len(s)),index =s)

ts.resample("Min").sum()

2019-03-27 09:18:00    3192
Freq: T, dtype: int32

UTC 世界时间标准

s = pd.date_range("today",periods=1,freq= "D")# 获取当前的时间
ts = pd.Series(np.random.randn(len(s)),s)#随机数
ts_utc = ts.tz_localize("UTC")#转换为UTC时间
ts_utc

2019-03-27 09:23:38.287795+00:00    1.236968
Freq: D, dtype: float64

转换为上海所在时区

ts_utc.tz_convert("Asia/Shanghai")

2019-03-27 17:23:38.287795+08:00    1.236968
Freq: D, dtype: float64

57.不同时间表示方式的转换

rng = pd.date_range("2018-1-1",periods=5,freq="M")
ts = pd.Series(np.random.randn(len(rng)),index = rng)
print(ts)
ps = ts.to_period()
print(ps)
ps.to_timestamp()

2018-01-31   -0.616241
2018-02-28    2.232648
2018-03-31    0.712430
2018-04-30   -0.249444
2018-05-31    1.265058
Freq: M, dtype: float64
2018-01   -0.616241
2018-02    2.232648
2018-03    0.712430
2018-04   -0.249444
2018-05    1.265058
Freq: M, dtype: float64





2018-01-01   -0.616241
2018-02-01    2.232648
2018-03-01    0.712430
2018-04-01   -0.249444
2018-05-01    1.265058
Freq: MS, dtype: float64

Series 多重索引

创建多重索引 Series

构建一个 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))为索引，值为随机数的多重索引 Series。

letter = ["A","B","C"]
numbers = list(range(10))
mi = pd.MultiIndex.from_product([letter,numbers]) # 设置多重索引
s = pd.Series(np.random.rand(30),index=mi)# 随机数
print(mi)
print(s)

MultiIndex(levels=[['A', 'B', 'C'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
A  0    0.891213
   1    0.879980
   2    0.685745
   3    0.546840
   4    0.251824
   5    0.932508
   6    0.575709
   7    0.422894
   8    0.942871
   9    0.850407
B  0    0.023095
   1    0.853654
   2    0.835034
   3    0.191593
   4    0.486485
   5    0.695431
   6    0.782958
   7    0.938840
   8    0.108407
   9    0.686810
C  0    0.686884
   1    0.512558
   2    0.087879
   3    0.303796
   4    0.272153
   5    0.433170
   6    0.476725
   7    0.602703
   8    0.166886
   9    0.929835
dtype: float64

多重索引 Series 查询

s.loc[:,[1,3,6]]# 查询索引为 1，3，6 的值

A  1    0.879980
   3    0.546840
   6    0.575709
B  1    0.853654
   3    0.191593
   6    0.782958
C  1    0.512558
   3    0.303796
   6    0.476725
dtype: float64

多重索引 Series 切片

s.loc[pd.IndexSlice[:"B",5:]]

A  5    0.932508
   6    0.575709
   7    0.422894
   8    0.942871
   9    0.850407
B  5    0.695431
   6    0.782958
   7    0.938840
   8    0.108407
   9    0.686810
dtype: float64

DataFrame 多重索引

根据多重索引创建 DataFrame

创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引，值为随机数据的多重索引 DataFrame。

frame = pd.DataFrame(np.arange(12).reshape(6,2),
        index = [list("AAABBB"),list("123123")],
        columns = ["hello","heitao"])
print(frame)

     hello  heitao
A 1      0       1
  2      2       3
  3      4       5
B 1      6       7
  2      8       9
  3     10      11

多重索引设置列名称

frame.index.names = ["first","second"]
frame

		hello	heitao
first	second
A	1	0	1
	2	2	3
	3	4	5
B	1	6	7
	2	8	9
	3	10	11

DataFrame 多重索引分组求和

a = frame.groupby("second").sum()
b = frame.groupby('first').sum()
print(a)
print(b)

        hello  heitao
second               
1           6       8
2          10      12
3          14      16
       hello  heitao
first               
A          6       9
B         24      27

DataFrame 行列名称转换

print(frame)
frame.stack()

              hello  heitao
first second               
A     1           0       1
      2           2       3
      3           4       5
B     1           6       7
      2           8       9
      3          10      11





first  second        
A      1       hello      0
               heitao     1
       2       hello      2
               heitao     3
       3       hello      4
               heitao     5
B      1       hello      6
               heitao     7
       2       hello      8
               heitao     9
       3       hello     10
               heitao    11
dtype: int32

DataFrame 索引转换

print(frame)
frame.unstack()

              hello  heitao
first second               
A     1           0       1
      2           2       3
      3           4       5
B     1           6       7
      2           8       9
      3          10      11

	hello			heitao
second	1	2	3	1	2	3
first
A	0	2	4	1	3	5
B	6	8	10	7	9	11

DataFrame 条件查找

# 示例数据

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)

df

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

# 查找 age 大于 3 的全部信息
df[df["age"]>3]

	animal	age	visits	priority
e	dog	5.0	2	no
g	snake	4.5	1	no
i	dog	7.0	2	no

根据行列索引切片

# 提取2-3行 1-2列数据
df.iloc[2:4,1:3]

	age	visits
c	0.5	2
d	NaN	3

DataFrame 多重条件查询

# 查找 age<3 且为 cat 的全部数据。
df = pd.DataFrame(data, index=labels)

df[(df['animal'] == 'cat') & (df['age'] < 3)]

	animal	age	visits	priority
a	cat	2.5	1	yes
f	cat	2.0	3	no

DataFrame 按关键字查询

df5[df5["animal"].isin(["cat","dog"])]

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

DataFrame 按标签及列名查询

df.loc[df2.index[[1,4,8]],["animal","age"]]

	animal	age
b	cat	3.0
e	dog	5.0
i	dog	7.0

DataFrame 多条件排序

# 按照 age 降序，visits 升序排列
df.sort_values(by = ["age","visits"],ascending=[False,True])

	animal	age	visits	priority
i	dog	7.0	2	no
e	dog	5.0	2	no
g	snake	4.5	1	no
j	dog	3.0	1	no
b	cat	3.0	3	yes
a	cat	2.5	1	yes
f	cat	2.0	3	no
c	snake	0.5	2	no
h	cat	NaN	1	yes
d	dog	NaN	3	yes

72.DataFrame 多值替换

# 将 priority 列的 yes 值替换为 True，no 值替换为 False。
df["priority"].map({"yes":True,"no":False})

a     True
b     True
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: priority, dtype: bool

DataFrame 分组求和

df4.groupby("animal").sum()

	age	visits	No.
animal
2	3.0	3	1
cat	4.0	5	12
dog	15.0	8	24
snake	5.0	3	8

使用列表拼接多个 DataFrame

temp_df1 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 3

print(temp_df1)
print(temp_df2)
print(temp_df3)

          0         1         2         3
0 -0.253063  0.954525 -1.172300  0.606517
1  0.463186 -0.906526  0.825088  0.962917
2 -0.713786  0.882766 -0.420237 -1.459033
3  0.403616  1.151884 -0.531666 -0.520436
4  0.470516  2.051418 -0.016450 -1.705573
          0         1         2         3
0  0.603258  2.760330  0.057452  0.485384
1  0.105471 -1.443187 -1.700213 -0.575068
2 -0.105902 -1.095674  1.195388  0.628840
3 -0.924175  0.222047  0.784529 -2.560038
4  0.195358  0.912300  0.449167  0.401706
          0         1         2         3
0  1.033782 -2.558680  0.560865  0.544706
1  0.191179 -1.389802  0.764125 -0.389027
2 -0.478880  0.187512  2.066756 -0.299723
3  1.023034 -0.049412  0.512549  0.310045
4  0.673487  0.911935  1.784224 -0.303885

找出 DataFrame 表中和最小的列

df = pd.DataFrame(np.random.random(size= (5,10)),columns = list("abcdefghij"))
print(df)
df.sum().idxmin()  # idxmax(), idxmin() 为 Series 函数返回最大最小值的索引值

          a         b         c         d         e         f         g  \
0  0.278099  0.959842  0.706874  0.961625  0.212731  0.708068  0.593584   
1  0.318883  0.605056  0.866065  0.358940  0.456914  0.435066  0.085913   
2  0.336288  0.999118  0.859492  0.066227  0.459371  0.357916  0.791723   
3  0.054229  0.213986  0.320105  0.019798  0.940241  0.485156  0.908761   
4  0.402781  0.169654  0.526144  0.003792  0.043333  0.143780  0.150949   

          h         i         j  
0  0.648064  0.383351  0.038961  
1  0.111805  0.653432  0.870372  
2  0.097711  0.946468  0.545376  
3  0.900600  0.130202  0.347659  
4  0.468224  0.122481  0.914841  





'a'

DataFrame 中每个元素减去每一行的平均值

df = pd.DataFrame(np.random.random(size = (5,3)))
print(df)
df.sub(df.mean(axis =1),axis =0)

          0         1         2
0  0.271421  0.734225  0.643852
1  0.947422  0.413795  0.300381
2  0.564317  0.080039  0.202109
3  0.265286  0.594217  0.935334
4  0.483712  0.419904  0.448885

	0	1	2
0	-0.278412	0.184392	0.094019
1	0.393556	-0.140071	-0.253485
2	0.282162	-0.202116	-0.080046
3	-0.332993	-0.004062	0.337055
4	0.032878	-0.030930	-0.001948

DataFrame 分组，并得到每一组中最大三个数之和

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
                   'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]})
print(df)

df.groupby("A")["B"].nlargest(3).sum(level =0)

A
a    409
b    156
c    345
Name: B, dtype: int64

透视表

当分析庞大的数据时，为了更好的发掘数据特征之间的关系，且不破坏原数据，就可以利用透视表 pivot_table 进行操作。

透视表的创建

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})
df

	A	B	C	D	E
0	one	A	foo	-1.249303	-1.321149
1	one	B	foo	0.970207	-0.194715
2	two	C	foo	0.906078	0.210847
3	three	A	bar	-0.671747	0.680809
4	one	B	bar	-0.551172	0.121524
5	one	C	bar	-1.764874	-0.047319
6	two	A	foo	-1.205649	0.253456
7	three	B	foo	-1.348864	-0.478615
8	one	C	foo	0.346110	1.625945
9	one	A	bar	1.684301	0.328802
10	two	B	bar	-0.174703	-0.938865
11	three	C	bar	0.584560	0.304935

pd.pivot_table(df,index = ["A","B"])

		D	E
A	B
one	A	0.217499	-0.496173
	B	0.209517	-0.036595
	C	-0.709382	0.789313
three	A	-0.671747	0.680809
	B	-1.348864	-0.478615
	C	0.584560	0.304935
two	A	-1.205649	0.253456
	B	-0.174703	-0.938865
	C	0.906078	0.210847

透视表按指定列进行聚合

将该 DataFrame 的 D 列聚合，按照 A,B 列为索引进行聚合，聚合的方式为默认求均值。

pd.pivot_table(df, values=['D'], index=['A', 'B'])

		D
A	B
one	A	0.217499
	B	0.209517
	C	-0.709382
three	A	-0.671747
	B	-1.348864
	C	0.584560
two	A	-1.205649
	B	-0.174703
	C	0.906078

透视表聚合方式定义

上一题中 D 列聚合时，采用默认求均值的方法，若想使用更多的方式可以在 aggfunc 中实现。

pd.pivot_table(df,values = ["D"],index = ["A","B"],aggfunc = [np.sum,len])

		sum	len
		D	D
A	B
one	A	0.434998	2.0
	B	0.419035	2.0
	C	-1.418764	2.0
three	A	-0.671747	1.0
	B	-1.348864	1.0
	C	0.584560	1.0
two	A	-1.205649	1.0
	B	-0.174703	1.0
	C	0.906078	1.0

透视表利用额外列进行辅助分割

D 列按照 A,B 列进行聚合时，若关心 C 列对 D 列的影响，可以加入 columns 值进行分析。

pd.pivot_table(df,values = ["D"],index = ["A","B"],
                columns = ["C"],aggfunc = np.sum)

		D
	C	bar	foo
A	B
one	A	1.684301	-1.249303
	B	-0.551172	0.970207
	C	-1.764874	0.346110
three	A	-0.671747	NaN
	B	NaN	-1.348864
	C	0.584560	NaN
two	A	NaN	-1.205649
	B	-0.174703	NaN
	C	NaN	0.906078

透视表的缺省值处理

在透视表中由于不同的聚合方式，相应缺少的组合将为缺省值，可以加入 fill_value 对缺省值处理。

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum, fill_value=0)

		D
	C	bar	foo
A	B
one	A	1.684301	-1.249303
	B	-0.551172	0.970207
	C	-1.764874	0.346110
three	A	-0.671747	0.000000
	B	0.000000	-1.348864
	C	0.584560	0.000000
two	A	0.000000	-1.205649
	B	-0.174703	0.000000
	C	0.000000	0.906078

绝对类型

在数据的形式上主要包括数量型和性质型，数量型表示着数据可数范围可变，而性质型表示范围已经确定不可改变，绝对型数据就是性质型数据的一种。

df = pd.DataFrame({"id":[1,2,3,4,5,6],
                    "raw_grade":["a","b","b","a","a","e"]})
df["grade"] = df["raw_grade"].astype("category")
df

	id	raw_grade	grade
0	1	a	a
1	2	b	b
2	3	b	b
3	4	a	a
4	5	a	a
5	6	e	e

对绝对型数据重命名

df["grade"].cat.categories = ["very good","good","very bad"]
df

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

重新排列绝对型数据并补充相应的缺省值

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"])
df

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

对绝对型数据进行排序

df.sort_values(by = "grade")

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

数据清洗

常常我们得到的数据是不符合我们最终处理的数据要求，包括许多缺省值以及坏的数据，需要我们对数据进行清洗。

缺失值拟合

在FilghtNumber中有数值缺失，其中数值为按 10 增长，补充相应的缺省值使得数据完整，并让数据为 int 类型。

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
                               'Budapest_PaRis', 'Brussels_londOn'],
                   'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
                   'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
                               '12. Air France', '"Swiss Air"']})

df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)

df

	From_To	FlightNumber	RecentDelays	Airline
0	LoNDon_paris	10045	[23, 47]	KLM(!)
1	MAdrid_miLAN	10055	[]	<Air France> (12)
2	londON_StockhOlm	10065	[24, 43, 87]	(British Airways. )
3	Budapest_PaRis	10075	[13]	12. Air France
4	Brussels_londOn	10085	[67, 32]	"Swiss Air"

数据列拆分

其中From_to应该为两独立的两列From和To，将From_to依照_拆分为独立两列建立为一个新表。

temp = df.From_To.str.split("_",expand = True)
temp.columns = ["From","To"]
temp

	From	To
0	LoNDon	paris
1	MAdrid	miLAN
2	londON	StockhOlm
3	Budapest	PaRis
4	Brussels	londOn

90. 字符标准化

其中注意到地点的名字都不规范（如：londON应该为London）需要对数据进行标准化处理。

temp['From'] = temp['From'].str.capitalize()

temp['To'] = temp['To'].str.capitalize()
temp

	From	To
0	London	Paris
1	Madrid	Milan
2	London	Stockholm
3	Budapest	Paris
4	Brussels	London

删除坏数据加入整理好的数据

将最开始的From_to列删除，加入整理好的From和to列

df = df.drop("From_To",axis=1)
df = df.join(temp)
df

	FlightNumber	RecentDelays	Airline	From	To
0	10045	[23, 47]	KLM(!)	London	Paris
1	10055	[]	<Air France> (12)	Madrid	Milan
2	10065	[24, 43, 87]	(British Airways. )	London	Stockholm
3	10075	[13]	12. Air France	Budapest	Paris
4	10085	[67, 32]	"Swiss Air"	Brussels	London

去除多余字符

如同 airline 列中许多数据有许多其他字符，会对后期的数据分析有较大影响，需要对这类数据进行修正。

df['Airline'] = df['Airline'].str.extract(
    '([a-zA-Z\s]+)', expand=False).str.strip()
df

	FlightNumber	RecentDelays	Airline	From	To
0	10045	[23, 47]	KLM	London	Paris
1	10055	[]	Air France	Madrid	Milan
2	10065	[24, 43, 87]	British Airways	London	Stockholm
3	10075	[13]	Air France	Budapest	Paris
4	10085	[67, 32]	Swiss Air	Brussels	London

格式规范

在 RecentDelays 中记录的方式为列表类型，由于其长度不一，这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开，取出列表中的相同位置元素作为一列，若为空值即用 NaN 代替。

delays = df["RecentDelays"].apply(pd.Series)

delays.columns = ["delay_{}".format(n) for n in range(1,len(delays.columns)+1)]

delays

	delay_1	delay_2	delay_3
0	23.0	47.0	NaN
1	NaN	NaN	NaN
2	24.0	43.0	87.0
3	13.0	NaN	NaN
4	67.0	32.0	NaN

df = df.drop("RecentDelays",axis =1).join(delays)
print(df)

   FlightNumber          Airline      From         To  delay_1  delay_2  \
0         10045              KLM    London      Paris     23.0     47.0   
1         10055       Air France    Madrid      Milan      NaN      NaN   
2         10065  British Airways    London  Stockholm     24.0     43.0   
3         10075       Air France  Budapest      Paris     13.0      NaN   
4         10085        Swiss Air  Brussels     London     67.0     32.0   

   delay_3  
0      NaN  
1      NaN  
2     87.0  
3      NaN  
4      NaN

数据预处理

信息区间划分

班级一部分同学的数学成绩表，如下图所示

df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],'grades':[58,83,79,65,93,45,61,88]})

df

	name	grades
0	Alice	58
1	Bob	83
2	Candy	79
3	Dany	65
4	Ella	93
5	Frank	45
6	Grace	61
7	Jenny	88

但我们更加关心的是该同学是否及格，将该数学成绩按照是否>60来进行划分。

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
                            'Frank', 'Grace', 'Jenny'], 'grades': [58, 83, 79, 65, 93, 45, 61, 88]})


def choice(x):
    if x > 60:
        return 1
    else:
        return 0


df.grades = pd.Series(map(lambda x: choice(x), df.grades))
df

	name	grades
0	Alice	0
1	Bob	1
2	Candy	1
3	Dany	1
4	Ella	1
5	Frank	0
6	Grace	1
7	Jenny	1

数据去重

# 一个列为A的 DataFrame 数据，如下图所示，尝试将 A 列中连续重复的数据清除。
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

df

	A
0	1
1	2
2	2
3	3
4	4
5	5
6	5
7	5
8	6
9	7
10	7

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.loc[df['A'].shift() != df['A']]

	A
0	1
1	2
3	3
4	4
5	5
8	6
9	7

数据归一化

有时候，DataFrame 中不同列之间的数据差距太大，需要对其进行归一化处理。其中，Max-Min 归一化是简单而常见的一种方式，公式如下:

$Y=\frac{X-X_{min}}{X_{max}-X_{min}}$

def normalization(df):
    numerator = df.sub(df.min())
    denominator = (df.max()).sub(df.min())
    Y = numerator.div(denominator)
    return Y


df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
normalization(df)

          0         1         2
0  0.785455  0.116930  0.317938
1  0.723757  0.272358  0.961036
2  0.032597  0.493209  0.258063
3  0.515626  0.771451  0.379079
4  0.094327  0.641445  0.968139

	0	1	2
0	1.000000	0.000000	0.084322
1	0.918048	0.237469	0.989997
2	0.000000	0.574892	0.000000
3	0.641593	1.000000	0.170427
4	0.081994	0.801372	1.000000

Pandas 绘图操作

为了更好的了解数据包含的信息，最直观的方法就是将其绘制成图。

%matplotlib inline
ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
ts = ts.cumsum()
ts.plot()

Pandas百题斩

DataFrame 折线图

df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

Pandas百题斩

DataFrame 散点图

df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
df = df.cumsum()
df.plot.scatter("xs", "ys", color='red', marker="*")

Pandas百题斩

DataFrame 柱形图

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
                   "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
                   "month": range(12)
                   })

ax = df.plot.bar("month", "revenue", color="yellow")
df.plot("month", "advertising", secondary_y=True, ax=ax)

Pandas百题斩
本文章主要作为学习历程的记录，如有侵权，请联系作者删除，原文地址https://www.shiyanlou.com/courses/

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9