pandas中的数据转换包括过滤、清理等
去除重复数据
duplicated() 判断各行是否是重复行
drop_duplicated() 移除重复行(保留第一次出现的)
没啥好说的,直接看例子:
In [20]: s = pd.DataFrame({'key':['a']*4+['b']*3,'key0':[1,1,2
...: ,3,3,4,4]})
In [21]: s.duplicated()
Out[21]:
0 False
1 True
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [22]: s.drop_duplicates()
Out[22]:
key key0
0 a 1
2 a 2
3 a 3
4 b 3
5 b 4
In [23]: s.drop_duplicates('key') # 可以根据某列去除重复的行
Out[23]:
key key0
0 a 1
4 b 3
In [24]: s.drop_duplicates(['key','key0']) # 传入一个列组成的列表,去除重复的行
Out[24]:
key key0
0 a 1
2 a 2
3 a 3
4 b 3
5 b 4
In [25]: s.key.drop_duplicates() # 嗯,这样写也是可以的
Out[25]:
0 a
4 b
Name: key, dtype: object
利用函数或映射进行数据转换
In [61]: data = pd.DataFrame({'food':['bacon','pulled pork','b
...: acon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
...: 'ounces':[4,3,12,6,7.5,8,3,5,6]})
In [62]: data
Out[62]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
假如你想添加一列表示该肉类食物来源的动物类型,我们先编写一个肉类到动物的映射。In [63]: meat_to_animal = {
...: 'bacon':'pig',
...: 'pulled pork':'pig',
...: 'pastrami':'cow',
...: 'corned beef':'cow',
...: 'honey ham':'pig',
...: 'nova lox':'salmon'
...: }
Series的map方法可以接受一个函数或含有映射关系的字典型对象,
但是这里有个问题:有些大写了,有些没有。因此需要先转换大小写
In [64]: data['animal'] = data['food'].map(str.lower).map(meat_to_animal)下面看一下map用来执行函数,即将data['food']的每个元素应用到隐含函数
In [65]: data['food'].map(lambda x:meat_to_animal[x.lower()]) Out[65]: 0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object
替换值
replace()
In [26]: re = pd.Series([1,-9999,-9999,2,3,4,5,-1000,0]) In [27]: re Out[27]: 0 1 1 -9999 2 -9999 3 2 4 3 5 4 6 5 7 -1000 8 0 dtype: int64 In [28]: re.replace(-9999,np.nan) # 替换值 Out[28]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 -1000.0 8 0.0 dtype: float64 In [29]: re.replace([-9999,-1000],np.nan) # 替换多个 Out[29]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 NaN 8 0.0 dtype: float64 In [30]: re.replace([-9999,-1000],[np.nan,0]) # 值与替换值对应的列表 Out[30]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 0.0 8 0.0 dtype: float64 In [32]: re.replace({-9999:np.nan,-1000:0}) # 参数可以是一个字典 Out[32]: 0 1.0 1 NaN 2 NaN 3 2.0 4 3.0 5 4.0 6 5.0 7 0.0 8 0.0 dtype: float64
重命名轴索引
rename() 会创建数据的副本,也可以传入 inplace=True 参数进行就地修改
In [41]: data = pd.DataFrame(np.arange(6).reshape((2, 3)),inde
...: x=pd.Index(['Oh', 'Co'], name='state'),columns=pd.Ind
...: ex(['one', 'two', 'three'], name='number'))
In [42]: data.rename(index=str.title,columns=str.upper)
Out[42]:
number ONE TWO THREE
state
Oh 0 1 2
Co 3 4 5
In [43]: data.rename(index={'co':'sx'},columns={'one':'first'} # 传入字典,可以部分修改
...: )
Out[43]:
number first two three
state
Oh 0 1 2
Co 3 4 5
离散化和面元划分
cut()
In [45]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] In [47]: In [47]: bins = [18, 25, 35, 60, 100] In [48]: cats = pd.cut(ages, bins) # 可以指定哪边的区间是开的,例如左闭右开,只需要设置 pd.cut(ages, bins,right=False) In [49]: cats # 结果返回的是一个特殊的 Categories 对象 Out[49]: [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
也可以为面元设置名称:
In [56]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] In [57]: pd.cut(ages, bins, labels=group_names) Out[57]: [Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult] Length: 12 Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
若不传入具体的面元划分边界,只传入划分的面元个数,则会自动等长划分面元:
In [58]: data = np.random.rand(20)
In [59]: pd.cut(data, 4, precision=2) # 分为4组,精度为2位
Out[59]:
[(0.29, 0.52], (0.75, 0.98], (0.75, 0.98], (0.057, 0.29], (0.29, 0.52], ...,
(0.75, 0.98], (0.75, 0.98], (0.75, 0.98], (0.057,0.29], (0.29, 0.52]]
Length: 20
Categories (4, interval[float64]): [(0.057, 0.29] < (0.29, 0.52] < (0.52, 0.75] < (0.75, 0.98]]qcut函数是一个类似于cut的函数,可以根据样本分位数对数据进行面元划分。根据数据,cut可能无法是各个面元数量数据点相同,qcut使用的是样本分位数,因此可以得大小基本相等的面元。
qcut就不举例了。
排列和随机采样
下面是随机选取一个DataFrame的一些行,做法就是随机产生行号,然后进行选取即可。
利用 numpy.random.permutation() 函数可以实现随机重排。 In [67]: df = pd.DataFrame(np.arange(5 * 4).reshape(5, 4)) In [68]: ran = np.random.permutation(5) In [70]: ran Out[70]: array([2, 3, 0, 1, 4]) In [71]: df Out[71]: 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19 In [72]: df.take(ran) Out[72]: 0 1 2 3 2 8 9 10 11 3 12 13 14 15 0 0 1 2 3 1 4 5 6 7 4 16 17 18 19
计算指标/哑变量
将分类变量转换为“哑变量矩阵”(dummy matrix)或“指标矩阵”(indicator matrix)。如果DataFrame的某一列有k各不同的值,可以派生出一个k列的矩阵或者DataFrame(值为1和0)
In [74]: df = pd.DataFrame({'key':['b','b','a','c','a','b'],'d
...: ata1' : range(6)})
In [75]: pd.get_dummies(df['key'])
Out[75]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
给指标DataFrame的列加上一个前缀
In [76]: dummies = pd.get_dummies(df['key'],prefix = 'key') In [77]: dummies Out[77]: key_a key_b key_c 0 0 1 0 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 5 0 1 0
顺便看看下面这个例子:
In [80]: df['data1'] Out[80]: 0 0 1 1 2 2 3 3 4 4 5 5 Name: data1, dtype: int64 In [81]: type(df['data1']) Out[81]: pandas.core.series.Series In [82]: df[['data1']] Out[82]: data1 0 0 1 1 2 2 3 3 4 4 5 5 In [83]: type(df[['data1']]) Out[83]: pandas.core.frame.DataFrame
df['data1']得到一个Series,而df[['data1']]得到一个DataFrame
字符串操作
Python有简单易用的字符串和文本处理功能。大部分文本运算直接做成了字符串对象的内置方法。当然还能用正则表达式。pandas对此进行了加强,能够对数组数据应用字符串表达式和正则表达式,而且能处理烦人的缺失数据。
字符串对象方法
举几个简单的例子:
In [87]: zifuchuan = ' i can be a can, i do not balabala' # 最前面有个空格
In [88]: sp = zifuchuan.split(',')
In [89]: sp
Out[89]: [' i can be a can', ' i do not balabala']
In [90]: ':::'.join(sp)
Out[90]: ' i can be a can::: i do not balabala'
In [91]: zifuchuan.index('can')
Out[91]: 3
In [92]: zifuchuan.index('i')
Out[92]: 1
In [94]: zifuchuan.count('can')
Out[94]: 2
正则表达式
正则表达式(regex)提供了一种灵活的在文本中搜索、匹配字符串的模式。
python中的正则表达式用的是re模块。re模块的函数分为3类:模式匹配、替换、拆分。
关于正则表达式的总结:http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
几举个简单的例子,以后有时间关于正则表达式再做一个总结。
In [119]: import re # 首先导入Python的re模块 In [120]: text = 'I love\t you' # 注意:这句后面没有空格 In [121]: re.split('\s+',text) Out[121]: ['I', 'love', 'you'] In [122]: text = 'I love\t you ' # 这句末尾多了一个空格,匹配时会把末尾的空格也算一个字符串 In [123]: re.split('\s+',text) Out[123]: ['I', 'love', 'you', '']
上面的例子首先正则表达式会被编译,然后在text上调用其split方法。
也可以这样写:
In [125]: patten = re.compile('\s+') # 先编译,得到一个可以重用的 regex 对象
In [126]: patten.split(text)
Out[126]: ['I', 'love', 'you', '']
findall、search、match 、sub
In [132]: text = """Dave [email protected] ...: Steve [email protected] ...: Rob [email protected] ...: Ryan [email protected] ...: """ In [133]: patten = r'[A-Z0-9._%+-][email protected][A-Z0-9.-]+\.[A-Z]{2,4}' # 匹配邮箱 In [134]: regex = re.compile(patten,flags=re.IGNORECASE) #先编译,忽略大小写 In [135]: regex.findall(text) # 返回所有匹配到的模式 Out[135]: ['[email protected]', '[email protected]', '[email protected] ', '[email protected]'] In [137]: m = regex.search(text) # 返回匹配到的第一个模式 In [138]: m Out[138]: <_sre.SRE_Match object; span=(5, 20), match='[email protected] gle.com'> In [141]: text[m.start():m.end()] Out[141]: '[email protected]' In [144]: m.string # 返回原始匹配串 Out[144]: 'Dave [email protected]\nSteve [email protected]\nRob rob @gmail.com\nRyan [email protected]\n' In [143]: print(regex.match(text)) # match 只匹配开头 这里开头是‘Dave’,所以没有匹配到,返回None None
In [147]: print(regex.sub('replace',text)) # 将匹配到的模式全部替换
Dave replace
Steve replace
Rob replace
Ryan replace
将匹配到的模式分组:
In [148]: patten = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,
...: 4})'
In [149]: regex = re.compile(patten,flags=re.IGNORECASE)
In [153]: regex.findall(text)
Out[153]:
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
In [155]: s = regex.search(text)
In [156]: s
Out[156]: <_sre.SRE_Match object; span=(5, 20), match='[email protected]
gle.com'>
In [157]: s.groups()
Out[157]: ('dave', 'google', 'com')
给匹配到到的模式命名:
In [163]: regex = re.compile(r"""
...: (?P<username>[A-Z0-9._%+-]+)
...: @(?P<domain>[A-Z0-9.-]+)
...: \.(?P<suffix>[A-Z]{2,4})""",
...: flags=re.IGNORECASE|re.VERBOSE)
...:
In [164]: m = regex.match('[email protected]')
In [165]: m.groupdict()
Out[165]: {'domain': 'bright', 'suffix': 'net', 'username': 'we
sm'}
In [171]: f = regex.search(text)
In [172]: f.group('username')
Out[172]: 'dave'
In [173]: f = regex.findall(text)
In [174]: f
Out[174]:
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
pandas矢量化字符串
In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev
...: e': '[email protected]','Rob': '[email protected]', 'Wes':
...: np.nan})
In [177]: series
Out[177]:
Dave [email protected]
Rob [email protected]
Steve [email protected]
Wes NaN
dtype: object
通过Series的str方法可以对Series的内容进行操作:
In [176]: series = pd.Series({'Dave': '[email protected]', 'Stev
...: e': '[email protected]','Rob': '[email protected]', 'Wes':
...: np.nan})
In [177]: series
Out[177]:
Dave [email protected]
Rob [email protected]
Steve [email protected]
Wes NaN
dtype: object
In [178]: series.str.contains('rob')
Out[178]:
Dave False
Rob True
Steve False
Wes NaN
dtype: object
In [179]: patten = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,
...: 4})'
In [180]: series.str.findall(patten,re.IGNORECASE)
Out[180]:
Dave [(dave, google, com)]
Rob [(rob, gmail, com)]
。。。
map函数在遇到NA值时会报错:
In [199]: matches = series.str.upper() In [200]: matches Out[200]: Dave [email protected] Rob [email protected] Steve [email protected] Wes NaN dtype: object In [202]:series.map(str.upper) --------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-202-029ad3593723> in <module>() ----> 1 matches = series.map(str.upper) ... TypeError: descriptor 'upper' requires a 'str' object but received a 'float'