pandas学习（创建数据，基本操作）

pandas学习（一）

Pandas基本数据结构

Series类型数据

Dataframe类型

基本操作

Pandas基本数据结构

两种常用数据结构：
Series
一维数组，与Numpy中的一维array类似，二者与Python基本数据结构List很相似，Series能保存不同数据类型，字符串，boolbean值、数字等都能保存在Series中

DataFrame
二维的表格型数据结构。很多功能与R中的data frame类似。可以将DataFrame理解为Series的容器。

Series类型数据

默认情况下下标是数字（可以使用额外参数指定），类型是统一的
第一个参数就是一维的数组，你可以通过传入列表，也可以使用numpy生成

初始化series

第一种方法通过numpy生成。

import pandas as pd
s=pd.Series(np.arange(0,6))
print(s)

注意：默认没有指定下标，所以从零开始。

第二种方法，通过传入列表。

import pandas as pd
s=pd.Series([1,3,6,np.nan,44,1])#np.nan是空值
print(s)

增加行标签

import pandas as pd
s=pd.Series([1,3,6,np.nan,44,1]，index=\'a\',\'b\',\'c\',\'d\',\'e\',\'f\')
print(s)
print(s.index)#行标签名
print(s.values)#所有值
print(s[0])#通过下标访问
print(s[1:4])#通过切片访问

b     1.0
f     3.0
e     6.0
d     NaN
a    44.0
c     1.0
dtype: float64

Index([\'b\', \'f\', \'e\', \'d\', \'a\', \'c\'], dtype=\'object\')

[ 1.  3.  6. nan 44.  1.]
1.0

f    3.0
e    6.0
d    NaN

Dataframe类型

第一种创建方式
DataFrame则是个二维结构，这里首先构造一组时间序列，作为我们第一组的下标

import pandas as pd
date1= pd.date_range(\'20190114\',periods=6)
print(date1)

DatetimeIndex([\'2019-01-14\', \'2019-01-15\', \'2019-01-16\', \'2019-01-17\',
               \'2019-01-18\', \'2019-01-19\'],
              dtype=\'datetime64[ns]\', freq=\'D\')

然后创建一个Dataframe结构，默认没有指定下标

不指定index和clumns时，默认从零开始
第一个参数其实就是一个二维数组，使用numpy可以生成

import pandas as pd
date1= pd.date_range(\'20190114\',periods=6)

df=pd.DataFrame(np.random.randn(6,4))#生成6行4列
print(df)

          0         1         2         3
0  0.019099  0.685904 -1.748481  0.944940
1 -1.754479 -0.229799 -0.581667  0.594955
2  0.302477  0.017760  0.747850 -0.516475
3  0.980783  0.215233  0.905535 -0.844875
4 -1.004730 -0.745205  0.409685  0.044063
5  1.302190 -0.355871 -0.009040  0.575193

指定下标，当然这里也可以用别的一维列表来指定行下标，不一定要使用时间序列。

import pandas as pd
date1= pd.date_range(\'20190114\',periods=6
df=pd.DataFrame(np.random.randn(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\']) #index为行下标，columns为列下标
print(df)

                   a         b         c         d
2019-01-14 -0.936922  0.727929 -0.021961 -1.042981
2019-01-15 -0.428813  0.304645  0.397171  1.601983
2019-01-16  1.529067 -0.745912 -0.028289 -1.497475
2019-01-17 -1.756153 -0.870725 -0.161252 -1.607953
2019-01-18 -1.474564  0.364908  1.038624  0.247872
2019-01-19 -0.071704  1.684375 -0.745785  0.301716

第二种创建方式

除了向Dataframe中传入二维数组，我们也可以使用字典传入数据
字典的每一个key代表一列，其value可以使各种能够转化为Series的对象
与Series要求所有的类型都一致不同，DataFrame只要求每一列数据的格式相同

import pandas as pd
df1 = pd.DataFrame({\'A\':1.,
                    \'B\':pd.Timestamp(\'20190114\'),
                    \'C\':np.array([3]*4,dtype=\'int32\'),
                    \'E\':pd.Categorical(["test","train","test","train"]),
                    \'F\':\'foo\',
                    \'G\':pd.Series([1,2,3,4])
}

)
print(df1)

     A          B  C      E    F  G
0  1.0 2019-01-14  3   test  foo  1
1  1.0 2019-01-14  3  train  foo  2
2  1.0 2019-01-14  3   test  foo  3
3  1.0 2019-01-14  3  train  foo  4

简单操作

import pandas as pd
df1 = pd.DataFrame({\'A\':1.,
                    \'B\':pd.Timestamp(\'20190114\'),
                    \'C\':np.array([3]*4,dtype=\'int32\'),
                    \'E\':pd.Categorical(["test","train","test","train"]),
                    \'F\':\'foo\',
                    \'G\':pd.Series([1,2,3,4])
}

)

print(df1.dtypes)#查看类型，查看每列的类型
print(df1.index)#查看行下标
print(df1.columns)#查看列下标
print(df1.values)#查看所有值
print(df1.describe())#查看平均数，方差等计算值

A           float64
B    datetime64[ns]
C             int32
E          category
F            object
G             int64
dtype: object
RangeIndex(start=0, stop=4, step=1)
Index([\'A\', \'B\', \'C\', \'E\', \'F\', \'G\'], dtype=\'object\')
[[1.0 Timestamp(\'2019-01-14 00:00:00\') 3 \'test\' \'foo\' 1]
 [1.0 Timestamp(\'2019-01-14 00:00:00\') 3 \'train\' \'foo\' 2]
 [1.0 Timestamp(\'2019-01-14 00:00:00\') 3 \'test\' \'foo\' 3]
 [1.0 Timestamp(\'2019-01-14 00:00:00\') 3 \'train\' \'foo\' 4]]
         A    C         G
count  4.0  4.0  4.000000
mean   1.0  3.0  2.500000
std    0.0  0.0  1.290994
min    1.0  3.0  1.000000
25%    1.0  3.0  1.750000
50%    1.0  3.0  2.500000
75%    1.0  3.0  3.250000
max    1.0  3.0  4.000000

基本操作

访问数据

通过标签选择数据

df.loc[index,columns],通过行和列的标签来选择数据

使用loc，select by label

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
print(df.loc[\'2019-01-15\',\'b\'])#通过行下标和列下标确定一个值
print(df.loc[:,\'b\'])#选择所有行，列为\'b\'
print(df.loc[\'2019-01-18\',:])#选择所有列，行为\'2019-01-18\'

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25

7

2019-01-14     3
2019-01-15     7
2019-01-16    11
2019-01-17    15
2019-01-18    19
2019-01-19    23
Freq: D, Name: b, dtype: int32

a    18
b    19
c    20
d    21
Name: 2019-01-18 00:00:00, dtype: int32

通过位置

使用iloc,select by position

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
print(df.iloc[0])#第一行
print(df.iloc[1])#第二行
print(df.iloc[0:4])#第一行到第四行

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25

a    2
b    3
c    4
d    5
Name: 2019-01-14 00:00:00, dtype: int32

a    6
b    7
c    8
d    9
Name: 2019-01-15 00:00:00, dtype: int32

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17

结合前面两种方法：通过标签和下标来选择

mixed selection:ix

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
print(df.ix[0:3,[\'a\',\'b\']])

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25

             a   b
2019-01-14   2   3
2019-01-15   6   7
2019-01-16  10  11

条件选择

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
print(df[df.a>8])#前面的df为范围，后面的df.a选择‘a’列
print(df[df[\'b\']==11])
print(df[(df.a==10)&(df.c==12)])
print(df.a[df.b>8])

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25

             a   b   c   d
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25

             a   b   c   d
2019-01-16  10  11  12  13
             a   b   c   d

2019-01-16  10  11  12  13

2019-01-16    10
2019-01-17    14
2019-01-18    18
2019-01-19    22
Freq: D, Name: a, dtype: int32

行操作

基本操作

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
# 查看前几行或者后几行，如果不指定参数，默认五行
print(df.head(3))
print(df.tail(2))
#print(df.loc[\'a\'])
#查看指定行，可以通过下标和标签两种形式指定
print(df[0:1])
print(df[\'2019-01-14\':\'2019-01-17\'])
# 也可以使用使用iloc
# print(df.iloc[0])
# print(df.iloc[1])
# print(df.iloc[0:4])

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13

             a   b   c   d
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25

            a  b  c  d
2019-01-14  2  3  4  5

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17

添加一行

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
date={\'a\':49,\'b\':34,\'c\':12,\'d\':98}#添加的一行数据
s=pd.Series(date)#生成一维的pd数据
##一定要给行命名，才能添加成功
s.name=\'2019-01-20 00:00:00\'
df=df.append(s)#添加
print(df)

             a   b   c   d
2019-01-14   2   3   4   5
2019-01-15   6   7   8   9
2019-01-16  10  11  12  13
2019-01-17  14  15  16  17
2019-01-18  18  19  20  21
2019-01-19  22  23  24  25
                      a   b   c   d
2019-01-14 00:00:00   2   3   4   5
2019-01-15 00:00:00   6   7   8   9
2019-01-16 00:00:00  10  11  12  13
2019-01-17 00:00:00  14  15  16  17
2019-01-18 00:00:00  18  19  20  21
2019-01-19 00:00:00  22  23  24  25
2019-01-20 00:00:00  49  34  12  98

删除一行

没有标签的情况，使用下标删除

df = pd.DataFrame(np.arange(12).reshape(3,4), columns=[\'A\', \'B\', \'C\', \'D\'])
print(df)
#Drop rows by index
# df=df.drop([0]) #删除第一行
# df=df.drop([1,2]) #删除2、3行
# print(df)

有标签的情况，使用标签删除

df = pd.DataFrame(np.arange(12).reshape(3,4),index=[\'a\',\'b\',\'c\'],columns=[\'A\', \'B\', \'C\', \'D\'])
print(df)
#Drop rows by index
# df=df.drop([\'a\'])#删除第一行
# df=df.drop([\'b\',\'c\'])#删除２和３行
print(df)

在有时间序列的下标情况下，使用下标和标签的方法，都会出错，暂时不知道什么问题

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])

df=df.drop([1])
print(df)
df=df.drop([\'2019-01-15\'])
print(df)

列操作

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
# print(df.columns) #输出所有列标签

# print(df[\'a\']) #输出\'a\'这一列

# print(df[\'a\'][1])#\'a\'列的第二行
# print(df[\'a\'][:3]) #\'a\'列的第一行到第三行
# print(df[[\'a\',\'b\']])#\'ａ\'列与\'b\'列

增加一列

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
print(df.columns)
df[\'序列\']=range(1,len(df)+1)#添加新的列，首先要知道一列需要多少个元素，再按要求生成
print(df)

删除某一列

date1= pd.date_range(\'20190114\',periods=6)
df=pd.DataFrame(np.arange(2,26).reshape(6,4),index=date1,columns=[\'a\',\'b\',\'c\',\'d\'])
print(df)
df=df.drop(\'序列\',axis=1)

.drop()方法

需要注意的地方

drop方法既可以保留原数据块中的所选列，也可以删除，这取决于参数inplace

df = pd.DataFrame(np.arange(12).reshape(3,4),index=[\'a\',\'b\',\'c\'],columns=[\'A\', \'B\', \'C\', \'D\'])
print(df)
# Drop rows by index
date3=df.drop([\'a\'])#删除第一行
print(date3)
print(df)

默认情况下，使用drop方法删除后，会返回被删除的一行，原数据也被删除

   A  B   C   D
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

   A  B   C   D
b  4  5   6   7
c  8  9  10  11

   A  B   C   D
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

当inplace=True时.drop()执行内部删除，不返回任何值，原数据发生改变

df = pd.DataFrame(np.arange(12).reshape(3,4),index=[\'a\',\'b\',\'c\'],columns=[\'A\', \'B\', \'C\', \'D\'])
print(df)
# Drop rows by index
date3=df.drop([\'a\'],inplace=True)#删除第一行
print(date3)
print(df)

运行结果显示没有输出被删除的一行，为none，原数据被删除

   A  B   C   D
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11
None
   A  B   C   D
b  4  5   6   7
c  8  9  10  11

.pop()方法

.pop方法可以将所选列从原数据块中弹出，原数据块不再保留该列

df = pd.DataFrame(np.arange(12).reshape(3,4),index=[\'a\',\'b\',\'c\'],columns=[\'A\', \'B\', \'C\', \'D\'])
print(df)
date=df.pop(\'A\')
print(date)
print(df)

结果显示弹出\'A\'列，并删除

   A  B   C   D
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

a    0
b    4
c    8
Name: A, dtype: int32

   B   C   D
a  1   2   3
b  5   6   7
c  9  10  11