百分位分布图答案

【问题标题】：Percentile Distribution Graph百分位分布图
【发布时间】：2017-02-06 16:33:05
【问题描述】：

有谁知道如何更改 X 轴刻度和刻度以显示如下图所示的百分位数分布？此图片来自 MATLAB，但我想使用 Python（通过 Matplotlib 或 Seaborn）来生成。

从@paulh 的指针来看，我现在离我更近了。这段代码

import matplotlib
matplotlib.use('Agg')

import numpy as np
import matplotlib.pyplot as plt
import probscale
import seaborn as sns

clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)

fig, ax = plt.subplots(figsize=(8, 4))

x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)

ax.set_xlim(40, 99.5)
ax.set_xscale('prob')

ax.plot(x, y)
sns.despine(fig=fig)

生成以下图（注意重新分布的 X 轴）

我发现它比标准量表更有用：

我联系了原始图表的作者，他们给了我一些指示。它实际上是一个对数比例图，x 轴反转，值为 [100-val]，手动标记 x 轴刻度。下面的代码使用与此处其他图表相同的示例数据重新创建原始图像。

import matplotlib
matplotlib.use('Agg')

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)

x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)

# Number of intervals to display.
# Later calculations add 2 to this number to pad it to align with the reversed axis
num_intervals = 3
x_values = 1.0 - 1.0/10**np.arange(0,num_intervals+2)

# Start with hard-coded lengths for 0,90,99
# Rest of array generated to display correct number of decimal places as precision increases
lengths = [1,2,2] + [int(v)+1 for v in list(np.arange(3,num_intervals+2))]

# Build the label string by trimming on the calculated lengths and appending %
labels = [str(100*v)[0:l] + "%" for v,l in zip(x_values, lengths)]


fig, ax = plt.subplots(figsize=(8, 4))

ax.set_xscale('log')
plt.gca().invert_xaxis()
# Labels have to be reversed because axis is reversed
ax.xaxis.set_ticklabels( labels[::-1] )

ax.plot([100.0 - v for v in x], y)

ax.grid(True, linewidth=0.5, zorder=5)
ax.grid(True, which='minor', linewidth=0.5, linestyle=':')

sns.despine(fig=fig)

plt.savefig("test.png", dpi=300, format='png')

这是结果图：

【问题讨论】：

您是否自己编写过任何代码或付出任何努力？如果是，请在此处发布。
我完全不明白为什么这个问题被关闭为太宽泛。虽然它缺乏一个好的问题描述，但从图表中看问题本身就很明显了。如果有办法生成这种图表，肯定只需要几行代码，所以答案既不会太长，也不会期望有太多可能的答案。
@Chris Osterwood 请提供生成这种图形的 matlab 命令，并以文本形式提供清晰的问题描述，而不仅仅是张贴图片。您可以通过将它们发布为评论来做到这一点，这样更有经验的用户可以将它们合并到问题中。
我想你想在我的库上使用：phobson.github.io/mpl-probscale
@PaulH - 非常感谢！我已经使用 mpl-probscale 用代码编辑了我的问题，它更接近我想要的。

标签： matplotlib seaborn

【解决方案1】：

这些类型的图表在低延迟社区中很受欢迎，用于绘制延迟分布。在处理延迟时，大多数有趣的信息往往位于较高的百分位数，因此对数视图往往效果更好。我第一次看到https://github.com/giltene/jHiccup 和https://github.com/HdrHistogram/ 中使用的这些图表。

引用的图由以下代码生成

n = ceil(log10(length(values)));          
p = 1 - 1./10.^(0:0.01:n);
percentiles = prctile(values, p * 100);
semilogx(1./(1-p), percentiles);

x 轴用下面的代码标记

labels = cell(n+1, 1);
for i = 1:n+1
  labels{i} = getPercentileLabel(i-1);
end
set(gca, 'XTick', 10.^(0:n));
set(gca, 'XTickLabel', labels);

% {'0%' '90%' '99%' '99.9%' '99.99%' '99.999%' '99.999%' '99.9999%'}
function label = getPercentileLabel(i)
    switch(i)
        case 0
            label = '0%';
        case 1
            label = '90%';
        case 2
            label = '99%';
        otherwise
            label = '99.';
            for k = 1:i-2
                label = [label '9'];
            end
            label = [label '%'];
    end
end

【讨论】：

Florian - 感谢您发布 MATLAB 代码，我相信这对其他人将来会有用。我同意这种规模对于具有“高尾”分布的数据更容易理解。

【解决方案2】：

以下 Python 代码使用 Pandas 读取包含已记录延迟值列表（以毫秒为单位）的 csv 文件，然后将这些延迟值（以微秒为单位）记录在 HdrHistogram 中，并将 HdrHistogram 保存到hgrm 文件，然后将由 Seaborn 到 plot 使用延迟分布图。

import pandas as pd
from hdrh.histogram import HdrHistogram
from hdrh.dump import dump
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import sys
import argparse

# Parse the command line arguments.

parser = argparse.ArgumentParser()
parser.add_argument('csv_file')
parser.add_argument('hgrm_file')
parser.add_argument('png_file')
args = parser.parse_args()

csv_file = args.csv_file
hgrm_file = args.hgrm_file
png_file = args.png_file

# Read the csv file into a Pandas data frame and generate an hgrm file.

csv_df = pd.read_csv(csv_file, index_col=False)

USECS_PER_SEC=1000000
MIN_LATENCY_USECS = 1
MAX_LATENCY_USECS = 24 * 60 * 60 * USECS_PER_SEC # 24 hours
# MAX_LATENCY_USECS = int(csv_df['response-time'].max()) * USECS_PER_SEC # 1 hour
LATENCY_SIGNIFICANT_DIGITS = 5
histogram = HdrHistogram(MIN_LATENCY_USECS, MAX_LATENCY_USECS, LATENCY_SIGNIFICANT_DIGITS)
for latency_sec in csv_df['response-time'].tolist():
    histogram.record_value(latency_sec*USECS_PER_SEC)
    # histogram.record_corrected_value(latency_sec*USECS_PER_SEC, 10)
TICKS_PER_HALF_DISTANCE=5
histogram.output_percentile_distribution(open(hgrm_file, 'wb'), USECS_PER_SEC, TICKS_PER_HALF_DISTANCE)

# Read the generated hgrm file into a Pandas data frame.

hgrm_df = pd.read_csv(hgrm_file, comment='#', skip_blank_lines=True, sep=r"\s+", engine='python', header=0, names=['Latency', 'Percentile'], usecols=[0, 3])

# Plot the latency distribution using Seaborn and save it as a png file.

sns.set_theme()
sns.set_style("dark")
sns.set_context("paper")
sns.set_color_codes("pastel")

fig, ax = plt.subplots(1,1,figsize=(20,15))
fig.suptitle('Latency Results')

sns.lineplot(x='Percentile', y='Latency', data=hgrm_df, ax=ax)
ax.set_title('Latency Distribution')
ax.set_xlabel('Percentile (%)')
ax.set_ylabel('Latency (seconds)')
ax.set_xscale('log')
ax.set_xticks([1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
ax.set_xticklabels(['0', '90', '99', '99.9', '99.99', '99.999', '99.9999', '99.99999'])

fig.tight_layout()
fig.savefig(png_file)

【讨论】：