pandas：查找给定列的百分位数统计信息答案

【问题标题】：pandas: find percentile stats of a given columnpandas：查找给定列的百分位数统计信息
【发布时间】：2017-01-27 15:47:42
【问题描述】：

我有一个 pandas 数据框 my_df，我可以在其中找到给定列的 mean()、median()、mode()：

my_df['field_A'].mean()
my_df['field_A'].median()
my_df['field_A'].mode()

我想知道是否有可能找到更详细的统计数据，例如 90%？谢谢！

【问题讨论】：

你的意思是this？
例如，假设 percentile() 是百分位函数。如果 my_df['field_A'].percentile(90) 返回 x，field_A 值

标签： python python-2.7 pandas statistics

【解决方案1】：

您甚至可以为多个列提供空值并获得多个分位数（我使用 95 个百分位数来处理异常值）

my_df[['field_A','field_B']].dropna().quantile([0.0, .5, .90, .95])

【讨论】：

【解决方案2】：

您可以使用pandas.DataFrame.quantile()函数，如下图所示。

import pandas as pd
import random

A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B })
df
#    field_A  field_B
# 0       90       72
# 1       63       84
# 2       11       74
# 3       61       66
# 4       78       80
# 5       67       75
# 6       89       47
# 7       12       22
# 8       43        5
# 9       30       64

df.field_A.mean()   # Same as df['field_A'].mean()
# 54.399999999999999

df.field_A.median() 
# 62.0

# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.

df.field_A.quantile(0.1) # 10th percentile
# 11.9

df.field_A.quantile(0.5) # same as median
# 62.0

df.field_A.quantile(0.9) # 90th percentile
# 89.10000000000001

【讨论】：

输出并不总是与某些单元格值相同。它会做任何插值吗？
是的。如果您查看quantile() 的 API，您会发现如果您想要一个位于数据中两个位置之间的分位数：'linear'、'lower'、'higher'、 “中点”或“最近”。默认情况下，它执行线性插值。这些插值方法在percentile 的维基百科文章中进行了讨论：en.wikipedia.org/wiki/Percentile
@stackoverflowuser2010 你如何在“Groupby”中获得分位数（i）？例如，如果我在上面的 df 中添加了一个名为“Category”的列，其属性为“a”、“b”和“c”，代码会是这样吗？我试过 df1= df['Category', 'field_A'].quantile(0.99,interpolation='higher') 但它不起作用。干杯
@jwlon81：您是否尝试为每个组中的数字计算分位数？如果是这样，请尝试以下操作：df.groupby('Category').field_A.quantile(0.1)。这将返回每组 Category 的第 10 个百分位数。

【解决方案3】：

假设系列s

s = pd.Series(np.arange(100))

获取[.1, .2, .3, .4, .5, .6, .7, .8, .9]的分位数

s.quantile(np.linspace(.1, 1, 9, 0))

0.1     9.9
0.2    19.8
0.3    29.7
0.4    39.6
0.5    49.5
0.6    59.4
0.7    69.3
0.8    79.2
0.9    89.1
dtype: float64

或

s.quantile(np.linspace(.1, 1, 9, 0), 'lower')

0.1     9
0.2    19
0.3    29
0.4    39
0.5    49
0.6    59
0.7    69
0.8    79
0.9    89
dtype: int32

【讨论】：

喜欢 'lower' 关键字

【解决方案4】：

我发现下面会起作用：

my_df.dropna().quantile([0.0, .9])

【讨论】：