【发布时间】:2016-05-02 07:03:24
【问题描述】:
我有一组对数正态分布的样本。我可以使用具有线性或对数 x 轴的直方图来可视化样本。我可以对直方图进行拟合以获取 PDF,然后使用线性 x 轴将其缩放到图中的直方图,另请参见 this previously posted question。
但是,我无法正确地将 PDF 绘制到带有对数 x 轴的图中。
不幸的是,这不仅是 PDF 区域缩放到直方图的问题,而且 PDF 也会向左移动,如下图所示。
我现在的问题是,我在这里做错了什么?使用 CDF 绘制预期的直方图 as suggested in this answer 是可行的。我只是想知道我在这段代码中做错了什么,因为据我所知它也应该有效。
这是python代码(对不起,它比较长,但我想发布一个“完整的单机版”):
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
# generate log-normal distributed set of samples
np.random.seed(42)
samples = np.random.lognormal( mean=1, sigma=.4, size=10000 )
# make a fit to the samples
shape, loc, scale = scipy.stats.lognorm.fit( samples, floc=0 )
x_fit = np.linspace( samples.min(), samples.max(), 100 )
samples_fit = scipy.stats.lognorm.pdf( x_fit, shape, loc=loc, scale=scale )
# plot a histrogram with linear x-axis
plt.subplot( 1, 2, 1 )
N_bins = 50
counts, bin_edges, ignored = plt.hist( samples, N_bins, histtype='stepfilled', label='histogram' )
# calculate area of histogram (area under PDF should be 1)
area_hist = .0
for ii in range( counts.size):
area_hist += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# oplot fit into histogram
plt.plot( x_fit, samples_fit*area_hist, label='fitted and area-scaled PDF', linewidth=2)
plt.legend()
# make a histrogram with a log10 x-axis
plt.subplot( 1, 2, 2 )
# equally sized bins (in log10-scale)
bins_log10 = np.logspace( np.log10( samples.min() ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )
# calculate area of histogram
area_hist_log = .0
for ii in range( counts.size):
area_hist_log += (bin_edges[ii+1]-bin_edges[ii]) * counts[ii]
# get pdf-values for log10 - spaced intervals
x_fit_log = np.logspace( np.log10( samples.min()), np.log10( samples.max()), 100 )
samples_fit_log = scipy.stats.lognorm.pdf( x_fit_log, shape, loc=loc, scale=scale )
# oplot fit into histogram
plt.plot( x_fit_log, samples_fit_log*area_hist_log, label='fitted and area-scaled PDF', linewidth=2 )
plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
plt.legend()
plt.show()
更新 1:
我忘了提及我正在使用的版本:
python 2.7.6
numpy 1.8.2
matplotlib 1.3.1
scipy 0.13.3
更新 2:
正如@Christoph 和@zaxliu 所指出的(感谢两者),问题在于PDF 的缩放。当我使用与直方图相同的 bin 时,它可以工作,就像在@zaxliu 的解决方案中一样,但是在为 PDF 使用更高分辨率时我仍然遇到一些问题(如我上面的示例所示)。如下图所示:
右侧图的代码是(我省略了导入和数据样本生成的东西,你可以在上面的例子中找到它们):
# equally sized bins in log10-scale
bins_log10 = np.logspace( np.log10( samples.min() ), np.log10( samples.max() ), N_bins )
counts, bin_edges, ignored = plt.hist( samples, bins_log10, histtype='stepfilled', label='histogram' )
# calculate length of each bin (required for scaling PDF to histogram)
bins_log_len = np.zeros( bins_log10.size )
for ii in range( counts.size):
bins_log_len[ii] = bin_edges[ii+1]-bin_edges[ii]
# get pdf-values for same intervals as histogram
samples_fit_log = scipy.stats.lognorm.pdf( bins_log10, shape, loc=loc, scale=scale )
# oplot fitted and scaled PDF into histogram
plt.plot( bins_log10, np.multiply(samples_fit_log,bins_log_len)*sum(counts), label='PDF using histogram bins', linewidth=2 )
# make another pdf with a finer resolution
x_fit_log = np.logspace( np.log10( samples.min()), np.log10( samples.max()), 100 )
samples_fit_log = scipy.stats.lognorm.pdf( x_fit_log, shape, loc=loc, scale=scale )
# calculate length of each bin (required for scaling PDF to histogram)
# in addition, estimate middle point for more accuracy (should in principle also be done for the other PDF)
bins_log_len = np.diff( x_fit_log )
samples_log_center = np.zeros( x_fit_log.size-1 )
for ii in range( x_fit_log.size-1 ):
samples_log_center[ii] = .5*(samples_fit_log[ii] + samples_fit_log[ii+1] )
# scale PDF to histogram
# NOTE: THIS IS NOT WORKING PROPERLY (SEE FIGURE)
pdf_scaled2hist = np.multiply(samples_log_center,bins_log_len)*sum(counts)
# oplot fit into histogram
plt.plot( .5*(x_fit_log[:-1]+x_fit_log[1:]), pdf_scaled2hist, label='PDF using own bins', linewidth=2 )
plt.xscale( 'log' )
plt.xlim( bin_edges.min(), bin_edges.max() )
plt.legend(loc=3)
【问题讨论】:
-
为什么不使用 CDF 创建预期的直方图,正如我在回答您的其他问题 (stackoverflow.com/questions/34893615/…) 中所建议的那样?
-
我应该补充一点,当我按照您的建议进行操作时,使用 CDF 绘制预期的直方图,它可以工作。我只想知道我在上面的例子中做错了什么,因为据我所知它也应该起作用......
-
我在这里可能错了,但看起来您在创建具有可变大小的 bin 的直方图时使用的是通常的 PDF(以便它们在对数图中具有相等的宽度)。没有理由假设 PDF 和直方图应该看起来一样,对吧?
标签: python matplotlib scipy statistics