如何将权重参数传递给 seaborn 的jointplot() 或底层kdeplot？答案

【问题标题】：How to pass weights argument to seaborn's jointplot() or the underlying kdeplot?如何将权重参数传递给 seaborn 的jointplot() 或底层kdeplot？
【发布时间】：2015-05-20 08:44:02
【问题描述】：

我尝试使用以下代码创建与 seaborn 的联合图：

import seaborn as sns 
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

testdata = pd.DataFrame(np.array([[100, 1, 3], [5, 2, 6], [25, 3, -4]]), index=['A', 'B', 'C'], columns=['counts', 'X', 'Y'])
counts = testdata['counts'].values
sns.jointplot('X', 'Y', data=testdata, kind='kde', joint_kws={'weights':counts})
plt.savefig('test.png')

现在joint_kws 不会引发错误，但从图中可以看出，权重肯定没有被考虑在内：

我也尝试使用JointGrid 来实现，将权重传递给边缘分布：

g = sns.JointGrid('X', 'Y', data=testdata)
x = testdata['X'].values
y = testdata['Y'].values
g.ax_marg_x.hist(x, bins=np.arange(-10,10), weights=counts)
g.ax_marg_y.hist(y, bins=np.arange(-10,10), weights=counts, orientation='horizontal')
g.plot_marginals(sns.distplot)
g.plot_join(sns.kdeplot, joint_kws={'weights':counts})
plt.savefig('test.png')

但这仅适用于边缘分布，而联合图仍然没有加权：

有人知道怎么做吗？

【问题讨论】：

好吧，我可能不适合这里，但你到底想看什么？
抱歉不清楚。我想加权数据点。 A、B 和 C 的权重分别为 100、5 和 25，因此数据点“A”应该比“B”更重要，并且对分布的贡献更大。与上图中的边缘分布相比，下图中的边缘分布显示了这种加权分布。
这是一种不用 seaborn 的方法：gist.github.com/tillahoffmann/…

标签： python seaborn

【解决方案1】：

不幸的是，这似乎是不可能的。

a feature request 于 2015 年 12 月提交，但因 will-not-fix 而被关闭。

这个 StackOverflow 问题也有讨论：weights option for seaborn distplot?

【讨论】：

【解决方案2】：

我知道这可以追溯到前一段时间，但我已经能够使用以下方法在联合图中使用权重：

p = sns.jointplot(data=v, x="x", y="y",  kind="hist", weights=v.weights, bins=50)

v 是具有列 [x,y,weights] 的数据框

【讨论】：

【解决方案3】：

你真的很亲密。

需要注意的是，join plot 执行以下操作（重述）：

def jointplot(x, y, data=None, ..., joint_kws):
    g = sns.JointGrid(...)
    g.plot_joint(..., **joint_kws)

所以当你自己打电话给g.plot_joint 时，只需喂它正常的kwargs：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

testdata = pd.DataFrame(
    np.array([[100, 1, 3], [5, 2, 6], [25, 3, -4]]), 
    index=['A', 'B', 'C'], 
    columns=['counts', 'X', 'Y']
)
counts = testdata['counts'].values

g = sns.JointGrid('X', 'Y', data=testdata)
g.plot_marginals(sns.distplot)
g.plot_joint(sns.kdeplot, weights=counts)

现在我不确定这看起来是否正确，但它没有呕吐，所以这是值得的。

【讨论】：

这听起来很合理，但情节仍然没有加权。取点 A (x=1, y=3)。它的计数是 100。所有计数的总和是 130 (100 + 5 + 25)。所以 A 的权重应该是 100/130，即 0.77，整个分布肯定应该有最大的峰值。