【Jupyter notebook】阅读模式,更好的体验!
- 在许多应用中,数据可能分散在许多文件或数据库中,存储的形式也不利于分析。本章关注可以聚合、合并、重塑数据的方法。
import pandas as pd
import numpy as np
8.1 层次化索引
- 层次化索引(hierarchical indexing)能在一个轴上拥有多个(两个以上)索引级别。抽象点说,它使你能以低维度形式处理高维度数据。
data = pd.Series(np.random.randn(9), index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
a 1 1.851062
2 0.498509
3 1.552038
b 1 0.839059
3 -0.765026
c 1 -1.431162
2 -1.587057
d 2 -1.012728
3 1.318710
dtype: float64
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
data['a']
1 1.851062
2 0.498509
3 1.552038
dtype: float64
data['b':'c']
b 1 0.839059
3 -0.765026
c 1 -1.431162
2 -1.587057
dtype: float64
data.loc[['b','d']]
b 1 0.839059
3 -0.765026
d 2 -1.012728
3 1.318710
dtype: float64
data[:,2]
a 0.498509
c -1.587057
d -1.012728
dtype: float64
data['a'][2]
0.4985087352895496
- 层次化索引在数据重塑和基于分组的操作(如透视表生成)中扮演着重要的角色。例如,可以通过unstack方法将这段数据重新安排到一个DataFrame中:
data.unstack()
|
1 |
2 |
3 |
| a |
1.851062 |
0.498509 |
1.552038 |
| b |
0.839059 |
NaN |
-0.765026 |
| c |
-1.431162 |
-1.587057 |
NaN |
| d |
NaN |
-1.012728 |
1.318710 |
data.unstack().stack()
a 1 1.851062
2 0.498509
3 1.552038
b 1 0.839059
3 -0.765026
c 1 -1.431162
2 -1.587057
d 2 -1.012728
3 1.318710
dtype: float64
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index = [['a','a','b','b'],[1,2,1,2]],
columns = [['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
frame
|
|
Ohio |
Colorado |
|
|
Green |
Red |
Green |
| a |
1 |
0 |
1 |
2 |
| 2 |
3 |
4 |
5 |
| b |
1 |
6 |
7 |
8 |
| 2 |
9 |
10 |
11 |
- 各层都可以有名字(可以是字符串,也可以是别的Python对象)。如果指定了名称,它们就会显示在控制台输出中:
frame.index.names = ['key1','key2']
frame.columns.names = ['state','color']
frame
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
| key1 |
key2 |
|
|
|
| a |
1 |
0 |
1 |
2 |
| 2 |
3 |
4 |
5 |
| b |
1 |
6 |
7 |
8 |
| 2 |
9 |
10 |
11 |
frame['Ohio']
|
color |
Green |
Red |
| key1 |
key2 |
|
|
| a |
1 |
0 |
1 |
| 2 |
3 |
4 |
| b |
1 |
6 |
7 |
| 2 |
9 |
10 |
重排与分级排序
- 需要重新调整某条轴上各级别的顺序,或根据指定级别上的值对数据进行排序。swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象(但数据不会发生变化):
frame
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
| key1 |
key2 |
|
|
|
| a |
1 |
0 |
1 |
2 |
| 2 |
3 |
4 |
5 |
| b |
1 |
6 |
7 |
8 |
| 2 |
9 |
10 |
11 |
frame.swaplevel('key1', 'key2')
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
| key2 |
key1 |
|
|
|
| 1 |
a |
0 |
1 |
2 |
| 2 |
a |
3 |
4 |
5 |
| 1 |
b |
6 |
7 |
8 |
| 2 |
b |
9 |
10 |
11 |
- sort_index则根据单个级别中的值对数据进行排序。交换级别时,常常也会用到sort_index,这样最终结果就是按照指定顺序进行字母排序了:
frame.sort_index(level=1)
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
| key1 |
key2 |
|
|
|
| a |
1 |
0 |
1 |
2 |
| b |
1 |
6 |
7 |
8 |
| a |
2 |
3 |
4 |
5 |
| b |
2 |
9 |
10 |
11 |
frame.swaplevel(0,1).sort_index(level=0)
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
| key2 |
key1 |
|
|
|
| 1 |
a |
0 |
1 |
2 |
| b |
6 |
7 |
8 |
| 2 |
a |
3 |
4 |
5 |
| b |
9 |
10 |
11 |
根据级别汇总统计
- 许多对DataFrame和Series的描述和汇总统计都有一个level选项,它用于指定在某条轴上求和的级别
frame.sum(level=1)
| state |
Ohio |
Colorado |
| color |
Green |
Red |
Green |
| key2 |
|
|
|
| 1 |
6 |
8 |
10 |
| 2 |
12 |
14 |
16 |
frame.sum(level=0)
| state |
Ohio |
Colorado |
| color |
Green |
Red |
Green |
| key1 |
|
|
|
| a |
3 |
5 |
7 |
| b |
15 |
17 |
19 |
frame.sum(level=1,axis=1)
|
color |
Green |
Red |
| key1 |
key2 |
|
|
| a |
1 |
2 |
1 |
| 2 |
8 |
4 |
| b |
1 |
14 |
7 |
| 2 |
20 |
10 |
利用DataFrame的列进行索引
- 想要将DataFrame的一个或多个列当做行索引来用,或者可能希望将行索引变成DataFrame的列
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two','two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
frame
|
a |
b |
c |
d |
| 0 |
0 |
7 |
one |
0 |
| 1 |
1 |
6 |
one |
1 |
| 2 |
2 |
5 |
one |
2 |
| 3 |
3 |
4 |
two |
0 |
| 4 |
4 |
3 |
two |
1 |
| 5 |
5 |
2 |
two |
2 |
| 6 |
6 |
1 |
two |
3 |
frame2 = frame.set_index(['c','d'])
frame2
|
|
a |
b |
| c |
d |
|
|
| one |
0 |
0 |
7 |
| 1 |
1 |
6 |
| 2 |
2 |
5 |
| two |
0 |
3 |
4 |
| 1 |
4 |
3 |
| 2 |
5 |
2 |
| 3 |
6 |
1 |
frame.set_index(['c','d'],drop=False)
|
|
a |
b |
c |
d |
| c |
d |
|
|
|
|
| one |
0 |
0 |
7 |
one |
0 |
| 1 |
1 |
6 |
one |
1 |
| 2 |
2 |
5 |
one |
2 |
| two |
0 |
3 |
4 |
two |
0 |
| 1 |
4 |
3 |
two |
1 |
| 2 |
5 |
2 |
two |
2 |
| 3 |
6 |
1 |
two |
3 |
- reset_index()正好与set_index()相反,层次化的索引会被转移到列里面
frame2.reset_index()
|
c |
d |
a |
b |
| 0 |
one |
0 |
0 |
7 |
| 1 |
one |
1 |
1 |
6 |
| 2 |
one |
2 |
2 |
5 |
| 3 |
two |
0 |
3 |
4 |
| 4 |
two |
1 |
4 |
3 |
| 5 |
two |
2 |
5 |
2 |
| 6 |
two |
3 |
6 |
1 |
8.2 合并数据集
pandas对象中的数据可以通过一些方式进行合并:
- pandas.merge可根据一个或多个键将不同DataFrame中的行连接起来。它实现的就是数据库的join操作。
- pandas.concat可以沿着一条轴将多个对象堆叠到一起。
- 实例方法combine_first可以将重复数据拼接在一起,用一个对象中的值填充另一个对象中的缺失值。
pandas.merge()
- 数据集的合并(merge)或连接(join)运算是通过一个或多个键将行连接起来的。这些运算是关系型数据库(基于SQL)的核心.
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df2 = pd.DataFrame({'key':['a','b','c'],'data2':range(3)})
df1
|
key |
data1 |
| 0 |
b |
0 |
| 1 |
b |
1 |
| 2 |
a |
2 |
| 3 |
c |
3 |
| 4 |
a |
4 |
| 5 |
a |
5 |
| 6 |
b |
6 |
df2
|
key |
data2 |
| 0 |
a |
0 |
| 1 |
b |
1 |
| 2 |
c |
2 |
pd.merge(df1,df2)
|
key |
data1 |
data2 |
| 0 |
b |
0 |
1 |
| 1 |
b |
1 |
1 |
| 2 |
b |
6 |
1 |
| 3 |
a |
2 |
0 |
| 4 |
a |
4 |
0 |
| 5 |
a |
5 |
0 |
| 6 |
c |
3 |
2 |
- 上述代码并没有指明使用哪个列进行连接,这时候默认是将重叠的列名当作键。不过最好指明
pd.merge(df1,df2,on='key')
|
key |
data1 |
data2 |
| 0 |
b |
0 |
1 |
| 1 |
b |
1 |
1 |
| 2 |
b |
6 |
1 |
| 3 |
a |
2 |
0 |
| 4 |
a |
4 |
0 |
| 5 |
a |
5 |
0 |
| 6 |
c |
3 |
2 |
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
df3
|
lkey |
data1 |
| 0 |
b |
0 |
| 1 |
b |
1 |
| 2 |
a |
2 |
| 3 |
c |
3 |
| 4 |
a |
4 |
| 5 |
a |
5 |
| 6 |
b |
6 |
df4
|
rkey |
data2 |
| 0 |
a |
0 |
| 1 |
b |
1 |
| 2 |
d |
2 |
pd.merge(df3,df4,left_on = 'lkey', right_on = 'rkey')
|
lkey |
data1 |
rkey |
data2 |
| 0 |
b |
0 |
b |
1 |
| 1 |
b |
1 |
b |
1 |
| 2 |
b |
6 |
b |
1 |
| 3 |
a |
2 |
a |
0 |
| 4 |
a |
4 |
a |
0 |
| 5 |
a |
5 |
a |
0 |
- 上述代码的结果里面c和d以及与之相关的数据消失了。默认情况下,merge做的是“内连接”;结果中的键是交集。
- 其他方式还有"left"、“right"以及"outer”。外连接求取的是键的并集,组合了左连接和右连接的效果:
pd.merge(df1, df2, how='outer')
|
key |
data1 |
data2 |
| 0 |
b |
0 |
1 |
| 1 |
b |
1 |
1 |
| 2 |
b |
6 |
1 |
| 3 |
a |
2 |
0 |
| 4 |
a |
4 |
0 |
| 5 |
a |
5 |
0 |
| 6 |
c |
3 |
2 |

- 要根据多个键进行合并,传入一个由列名组成的list 即可
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})
pd.merge(left,right,on = ['key1','key2'],how='outer')
|
key1 |
key2 |
lval |
rval |
| 0 |
foo |
one |
1.0 |
4.0 |
| 1 |
foo |
one |
1.0 |
5.0 |
| 2 |
foo |
two |
2.0 |
NaN |
| 3 |
bar |
one |
3.0 |
6.0 |
| 4 |
bar |
two |
NaN |
7.0 |
left
|
key1 |
key2 |
lval |
| 0 |
foo |
one |
1 |
| 1 |
foo |
two |
2 |
| 2 |
bar |
one |
3 |
right
|
key1 |
key2 |
rval |
| 0 |
foo |
one |
4 |
| 1 |
foo |
one |
5 |
| 2 |
bar |
one |
6 |
| 3 |
bar |
two |
7 |
- 对于合并运算需要考虑的对重复列名的处理。merge有一个的suffixes选项,用于指定附加到左右两个DataFrame对象的重叠列名上的字符串:
pd.merge(left, right, on = 'key1')
|
key1 |
key2_x |
lval |
key2_y |
rval |
| 0 |
foo |
one |
1 |
one |
4 |
| 1 |
foo |
one |
1 |
one |
5 |
| 2 |
foo |
two |
2 |
one |
4 |
| 3 |
foo |
two |
2 |
one |
5 |
| 4 |
bar |
one |
3 |
one |
6 |
| 5 |
bar |
one |
3 |
two |
7 |
pd.merge(left, right,on='key1',suffixes=('_left','_right'))
|
key1 |
key2_left |
lval |
key2_right |
rval |
| 0 |
foo |
one |
1 |
one |
4 |
| 1 |
foo |
one |
1 |
one |
5 |
| 2 |
foo |
two |
2 |
one |
4 |
| 3 |
foo |
two |
2 |
one |
5 |
| 4 |
bar |
one |
3 |
one |
6 |
| 5 |
bar |
one |
3 |
two |
7 |
- merge的参数如下:

索引上的合并
- DataFrame中的连接键位于其索引中。在这种情况下,你可以传入left_index=True或right_index=True(或两个都传)以说明索引应该被用作连接键:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
left1
|
key |
value |
| 0 |
a |
0 |
| 1 |
b |
1 |
| 2 |
a |
2 |
| 3 |
a |
3 |
| 4 |
b |
4 |
| 5 |
c |
5 |
right1
pd.merge(left1,right1,left_on='key',right_index=True)
|
key |
value |
group_val |
| 0 |
a |
0 |
3.5 |
| 2 |
a |
2 |
3.5 |
| 3 |
a |
3 |
3.5 |
| 1 |
b |
1 |
7.0 |
| 4 |
b |
4 |
7.0 |
- 默认的merge方法是求取连接键的交集,因此你可以通过外连接的方式得到它们的并集:
pd.merge(left1,right1,left_on='key',right_index=True,how='outer')
|
key |
value |
group_val |
| 0 |
a |
0 |
3.5 |
| 2 |
a |
2 |
3.5 |
| 3 |
a |
3 |
3.5 |
| 1 |
b |
1 |
7.0 |
| 4 |
b |
4 |
7.0 |
| 5 |
c |
5 |
NaN |
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])
lefth
|
key1 |
key2 |
data |
| 0 |
Ohio |
2000 |
0.0 |
| 1 |
Ohio |
2001 |
1.0 |
| 2 |
Ohio |
2002 |
2.0 |
| 3 |
Nevada |
2001 |
3.0 |
| 4 |
Nevada |
2002 |
4.0 |
righth
|
|
event1 |
event2 |
| Nevada |
2001 |
0 |
1 |
| 2000 |
2 |
3 |
| Ohio |
2000 |
4 |
5 |
| 2000 |
6 |
7 |
| 2001 |
8 |
9 |
| 2002 |
10 |
11 |
- 这种情况下,你必须以列表的形式指明用作合并键的多个列(注意用how='outer’对重复索引值的处理):
pd.merge(lefth, righth, left_on=['key1','key2'], right_index=True)
|
key1 |
key2 |
data |
event1 |
event2 |
| 0 |
Ohio |
2000 |
0.0 |
4 |
5 |
| 0 |
Ohio |
2000 |
0.0 |
6 |
7 |
| 1 |
Ohio |
2001 |
1.0 |
8 |
9 |
| 2 |
Ohio |
2002 |
2.0 |
10 |
11 |
| 3 |
Nevada |
2001 |
3.0 |
0 |
1 |
pd.merge(lefth, righth, left_on=['key1','key2'], right_index=True,how='outer')
|
key1 |
key2 |
data |
event1 |
event2 |
| 0 |
Ohio |
2000 |
0.0 |
4.0 |
5.0 |
| 0 |
Ohio |
2000 |
0.0 |
6.0 |
7.0 |
| 1 |
Ohio |
2001 |
1.0 |
8.0 |
9.0 |
| 2 |
Ohio |
2002 |
2.0 |
10.0 |
11.0 |
| 3 |
Nevada |
2001 |
3.0 |
0.0 |
1.0 |
| 4 |
Nevada |
2002 |
4.0 |
NaN |
NaN |
| 4 |
Nevada |
2000 |
NaN |
2.0 |
3.0 |
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b', 'c', 'd', 'e'],
columns=['Missouri', 'Alabama'])
left2
|
Ohio |
Nevada |
| a |
1.0 |
2.0 |
| c |
3.0 |
4.0 |
| e |
5.0 |
6.0 |
right2
|
Missouri |
Alabama |
| b |
7.0 |
8.0 |
| c |
9.0 |
10.0 |
| d |
11.0 |
12.0 |
| e |
13.0 |
14.0 |
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
|
Ohio |
Nevada |
Missouri |
Alabama |
| a |
1.0 |
2.0 |
NaN |
NaN |
| b |
NaN |
NaN |
7.0 |
8.0 |
| c |
3.0 |
4.0 |
9.0 |
10.0 |
| d |
NaN |
NaN |
11.0 |
12.0 |
| e |
5.0 |
6.0 |
13.0 |
14.0 |
- DataFrame还有一个便捷的join实例方法,它能更为方便地实现按索引合并。它还可用于合并多个带有相同或相似索引的DataFrame对象,但要求没有重叠的列。在上面那个例子中,我们可以编写:
left2.join(right2,how='outer')
|
Ohio |
Nevada |
Missouri |
Alabama |
| a |
1.0 |
2.0 |
NaN |
NaN |
| b |
NaN |
NaN |
7.0 |
8.0 |
| c |
3.0 |
4.0 |
9.0 |
10.0 |
| d |
NaN |
NaN |
11.0 |
12.0 |
| e |
5.0 |
6.0 |
13.0 |
14.0 |
right2.join(left2,how='outer')
|
Missouri |
Alabama |
Ohio |
Nevada |
| a |
NaN |
NaN |
1.0 |
2.0 |
| b |
7.0 |
8.0 |
NaN |
NaN |
| c |
9.0 |
10.0 |
3.0 |
4.0 |
| d |
11.0 |
12.0 |
NaN |
NaN |
| e |
13.0 |
14.0 |
5.0 |
6.0 |
left1.join(right1, on='key')
|
key |
value |
group_val |
| 0 |
a |
0 |
3.5 |
| 1 |
b |
1 |
7.0 |
| 2 |
a |
2 |
3.5 |
| 3 |
a |
3 |
3.5 |
| 4 |
b |
4 |
7.0 |
| 5 |
c |
5 |
NaN |