OliverQin

1.axis与skipna参数的使用

注意:np.nan表示空值

# 建立数据集
import numpy as np
import pandas as pd

df = pd.DataFrame({\'key1\':[4,5,6,np.nan,2], # np.nan表示空值
                   \'key2\':[1,2,np.nan,15,10],
                   \'key3\':[1,2,3,\'m\',\'n\'],
                   \'key4\':[\'a1\',\'a2\',\'a3\',\'a4\',\'a5\']},index = [\'a\',\'b\',\'c\',\'d\',\'e\']
)
print(df)
print(\'-----------------分割线-------------------\')
# 打印每列数据类型
print(df[\'key1\'].dtype,df[\'key2\'].dtype,df[\'key3\'].dtype,df[\'key4\'].dtype)

# 求整体平均值(只会对数字列进行统计)
print(\'----------------整体平均------------------\')
print(df.mean())

# 根据索引求平均值
print(\'key2列的平均值:\',df[\'key2\'].mean())

print(\'----------------axis参数------------------\')
# 单独统计一列的平均值,axis=0表示以列计算 axis=1表示以行计算,默认axis=0
m1 = df.mean(axis=1) 
print(m1)

print(\'----------------skipna参数------------------\')
# skipna参数:是否忽略NaN,默认True,如False,有NaN的列统计结果仍未NaN
m2 = df.mean(skipna=False) 
print(m2)

输出结果:

2.常用统计方法

其他随机方法列表如下:

  • count:统计非Na值的数量
  • min:统计最小值
  • max:统计最大值
  • quantile:统计分位数,参数q确定位置,例如:quantile(q=0.75)
  • sum:求和
  • median:求中位数
  • std:标准差
  • var:方差
  • skew:偏度
  • kurt:峰度

实战演练

import numpy as np
import pandas as pd

df = pd.DataFrame({\'key1\':[1,2,3,4,25,6,7,8,9,10],
                  \'key2\':[10,20,30,300,50,60,70,80,90,100]}) 
print(df)
print(\'-\' * 20)
# 求最大值
print("计数count:")
print(df.count())

print(\'-\' * 20)

print("最小值min:")
print(df.min())

print(\'-\' * 20)

print("最大值min:")
print(df.max())

print(\'-\' * 20)

print("第三分位数quantile:")
print(df.quantile(0.75))

print(\'-\' * 20)

print("求和sum:")
print(df.sum())

print(\'-\' * 20)

print("求中位数median:")
print(df.median())

print(\'-\' * 20)

print("求标准差std:")
print(df.std())

print(\'-\' * 20)

print("求方差var:")
print(df.var())

print(\'-\' * 20)

print("求偏度skew:")
print(df.skew())

print(\'-\' * 20)

print("求峰度kurt:")
print(df.kurt())


输出结果:

3.求累计值:cumsum(累计和),累计积

import numpy as np
import pandas as pd

df = pd.DataFrame({\'key1\':[1,2,3,4,25,6,7,8,9,10],
                  \'key2\':[10,20,30,300,50,60,70,80,90,100]}) 
print(df)
print(\'-------------累计和&累计积-----------------\')
df[\'key1累计和\'] = df[\'key1\'].cumsum()
df[\'key2累计和\'] = df[\'key2\'].cumsum()

df[\'key1累计积\'] = df[\'key1\'].cumsum()
df[\'key2累计积\'] = df[\'key1\'].cumsum()
print(df) 

输出结果:

4.唯一值:unique

import pandas as pd
s = pd.Series(list(\'abdsssdsd\'))
print(s)

# 求唯一值
sq = s.unique()
print(\'-\' * 50)
print(sq,type(sq))

输出结果:

5.值计数:value_counts()

值计数主要针对新的Series,计算出不同值出现的频率

import pandas as pd
s = pd.Series(list(\'abdsssdsd\'))
print(s)
print(\'-\' * 50)
print(s.value_counts(sort = False)) # sort参数:排序,默认为True

输出结果:

6.成员资格:isin()

判断某个数据集的元素是否在另外一个数据集中,返回结果为布尔值。

s = pd.Series(np.arange(10,13))
df = pd.DataFrame({\'key1\':list(\'asdc\'),
                  \'key2\':np.arange(4,8)})
print(s)
print(\'-\' * 50)
print(df)
print(\'-\' * 50)

print(s.isin([12,15]))
print(\'-\' * 50)
print(df.isin([\'a\',\'bc\',\'10\',8]))

输出结果:

分类:

技术点:

相关文章:

  • 2021-08-13
  • 2021-12-10
  • 2021-12-28
  • 2020-05-07
  • 2021-11-30
猜你喜欢
  • 2021-05-31
  • 2021-08-01
  • 2021-05-17
  • 2021-11-23
  • 2022-01-09
  • 2021-06-26
  • 2021-12-02
相关资源
相似解决方案