【问题标题】:Finding areas of non-overlapping edges of two distributions查找两个分布的非重叠边缘区域
【发布时间】:2021-10-25 20:51:06
【问题描述】:

我正在尝试找到一种方法来计算两个分布的重叠(交集)和非重叠区域。根据一些帖子(例如:FINDING AREA),我可以弄清楚如何计算两个图的交集。但是我没有找到不重叠的部分。
鉴于此示例分布:

import pandas as pd
import numpy as np
import seaborn as  sns

data1=np.random.normal(0, 0.1, 100)
data2=np.random.normal(0, 0.3, 100)

x0=pd.Series(data1)
x1=pd.Series(data2)
kwargs = dict(hist_kws={'alpha':.01}, kde_kws={'linewidth':2})
sns.distplot(x0,bins=3, color="dodgerblue", **kwargs)
sns.distplot(x1,bins=3, color="dodgerblue", **kwargs)

我想知道如何计算两个分布在边缘的面积?
我应该提到我也有数据(如果这是我可以用数据本身做的事情)。

【问题讨论】:

  • 我最初的想法是找到两个分布的交点,然后将其分解为几个部分。假设较小的峰值分布是 A,而较大的峰值是 B。对于紫色部分,从 A 的起点到左侧交点对 A - B 进行积分。对于红色部分,A 和 B 的右交点,然后将 A - B 整合到 A 的末尾。不过,数据确实会有所帮助,并有助于分散我对实际工作的注意力 :)
  • 是的,请提供数据
  • 一个kde的面积是1。所以不重叠的部分是1减去重叠的部分。一般来说,不会很好地划分为 4 个独立的区域。

标签: python numpy matplotlib scipy seaborn


【解决方案1】:

answer you linked 中汲取灵感,从数据中您应该通过scipy.stats.gaussian_kde 计算kde 曲线:

data1 = np.random.normal(0, 0.1, 100)
data2 = np.random.normal(0, 0.3, 100)

kde1 = gaussian_kde(data1, bw_method = 0.5)
kde2 = gaussian_kde(data2, bw_method = 0.5)

xmin = min(data1.min(), data2.min())
xmax = max(data1.max(), data2.max())
dx = 0.2*(xmax - xmin)
xmin -= dx
xmax += dx

x = np.linspace(xmin, xmax, 1000)
kde1_x = kde1(x)
kde2_x = kde2(x)

然后你需要找到kde曲线之间的交点:

idx = np.argwhere(np.diff(np.sign(kde1_x - kde2_x))).flatten()

其中idx 是交点索引列表。
最后,您可以通过numpy.trapz 计算面积,使用先前计算的索引对xkde1_xkde2_x 进行切片:

area1 = np.trapz(kde2_x[:idx[0]] - kde1_x[:idx[0]], x[:idx[0]]) # area under the lower kde, from the first leftmost point to the first intersection point
area2 = np.trapz(kde2_x[idx[1]:] - kde1_x[idx[1]:], x[idx[1]:]) # area under the lower kde, from the second intersection point to the last rightmost point
area3 = np.trapz(np.minimum(kde1_x, kde2_x), x) # intersection area between of the kde curves, between the two intersection points
area4 = np.trapz(kde1_x[idx[0]:idx[1]] - kde2_x[idx[0]:idx[1]], x[idx[0]:idx[1]]) # area under the highest kde, excluding the lower one, between the two intersection points

完整代码

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde


data1 = np.random.normal(0, 0.1, 100)
data2 = np.random.normal(0, 0.3, 100)

kde1 = gaussian_kde(data1, bw_method = 0.5)
kde2 = gaussian_kde(data2, bw_method = 0.5)

xmin = min(data1.min(), data2.min())
xmax = max(data1.max(), data2.max())
dx = 0.2*(xmax - xmin)
xmin -= dx
xmax += dx

x = np.linspace(xmin, xmax, 1000)
kde1_x = kde1(x)
kde2_x = kde2(x)


idx = np.argwhere(np.diff(np.sign(kde1_x - kde2_x))).flatten()

area1 = np.trapz(kde2_x[:idx[0]] - kde1_x[:idx[0]], x[:idx[0]]) # area under the lower kde, from the first leftmost point to the first intersection point
area2 = np.trapz(kde2_x[idx[1]:] - kde1_x[idx[1]:], x[idx[1]:]) # area under the lower kde, from the second intersection point to the last rightmost point
area3 = np.trapz(np.minimum(kde1_x, kde2_x), x) # intersection area between of the kde curves, between the two intersection points
area4 = np.trapz(kde1_x[idx[0]:idx[1]] - kde2_x[idx[0]:idx[1]], x[idx[0]:idx[1]]) # area under the highest kde, excluding the lower one, between the two intersection points


fig, ax = plt.subplots()

ax.plot(x, kde1_x, color = 'dodgerblue', linewidth = 2)
ax.plot(x, kde2_x, color = 'orangered', linewidth = 2)

ax.fill_between(x[:idx[0]], kde2_x[:idx[0]], kde1_x[:idx[0]], color = 'dodgerblue', alpha = 0.3, label = 'area1')
ax.fill_between(x[idx[1]:], kde2_x[idx[1]:], kde1_x[idx[1]:], color = 'orangered', alpha = 0.3, label = 'area2')
ax.fill_between(x, np.minimum(kde1_x, kde2_x), 0, color = 'lime', alpha = 0.3, label = 'area3')
ax.fill_between(x[idx[0]:idx[1]], kde1_x[idx[0]:idx[1]], kde2_x[idx[0]:idx[1]], color = 'gold', alpha = 0.3, label = 'area4')

ax.plot(x[idx], kde2_x[idx], 'ko')

handles, labels = ax.get_legend_handles_labels()
labels[0] += f': {area1 * 100:.1f}%'
labels[1] += f': {area2 * 100:.1f}%'
labels[2] += f': {area3 * 100:.1f}%'
labels[3] += f': {area4 * 100:.1f}%'
ax.legend(handles, labels)

plt.show()

【讨论】:

  • 不错。一些评论:我认为这些区域需要向上直到交叉点,否则你之间会有一点差距(缩放时可见)。所以例如area1 = np.trapz(kde2_x[:idx[0]+1] - kde1_x[:idx[0]+1], x[:idx[0]+1])。除此之外,np.diff 引起了一个位置的移动(我认为真正的交叉点在idx[0]+1idx[0]+2 之间)。另见How to find the exact intersection of a curve?
  • 我对@9​​87654341@ 的想法是np.trapz(kde2_x[x<0] - np.minimum(kde1_x, kde2_x)[x<0], x[x<0]),假设我们知道蓝色区域完全是零。与红色类似。黄色只是 kde1 减去绿色。当然,任何一种方法都假设恰好有 2 个交点。
  • 不错……但请注意gaussian_kde 对象上有一些方法可以进行集成,这将更容易和更准确。
猜你喜欢
  • 1970-01-01
  • 2021-07-15
  • 2020-10-04
  • 1970-01-01
  • 2010-12-05
  • 1970-01-01
  • 2011-09-14
  • 2021-09-22
  • 1970-01-01
相关资源
最近更新 更多