对于跨类别的 CI,建立在 @omer sagi 建议的基础上,假设我们有一个 pandas 数据框,其列包含类别(如 category 1、category 2 和 category 3)和另一个具有连续数据(例如某种rating)的函数,这是一个使用pd.groupby() 和scipy.stats 绘制具有置信区间的组间均值差异的函数:
import pandas as pd
import numpy as np
import scipy.stats as st
def plot_diff_in_means(data: pd.DataFrame, col1: str, col2: str):
"""
given data, plots difference in means with confidence intervals across groups
col1: categorical data with groups
col2: continuous data for the means
"""
n = data.groupby(col1)[col2].count()
# n contains a pd.Series with sample size for each category
cat = list(data.groupby(col1, as_index=False)[col2].count()[col1])
# cat has names of the categories, like 'category 1', 'category 2'
mean = data.groupby(col1)[col2].agg('mean')
# the average value of col2 across the categories
std = data.groupby(col1)[col2].agg(np.std)
se = std / np.sqrt(n)
# standard deviation and standard error
lower = st.t.interval(alpha = 0.95, df=n-1, loc = mean, scale = se)[0]
upper = st.t.interval(alpha = 0.95, df =n-1, loc = mean, scale = se)[1]
# calculates the upper and lower bounds using scipy
for upper, mean, lower, y in zip(upper, mean, lower, cat):
plt.plot((lower, mean, upper), (y, y, y), 'b.-')
# for 'b.-': 'b' means 'blue', '.' means dot, '-' means solid line
plt.yticks(
range(len(n)),
list(data.groupby(col1, as_index = False)[col2].count()[col1])
)
给定一个假设数据:
cat = ['a'] * 10 + ['b'] * 10 + ['c'] * 10
a = np.linspace(0.1, 5.0, 10)
b = np.linspace(0.5, 7.0, 10)
c = np.linspace(7.5, 20.0, 10)
rating = np.concatenate([a, b, c])
dat_dict = dict()
dat_dict['cat'] = cat
dat_dict['rating'] = rating
test_dat = pd.DataFrame(dat_dict)
看起来像这样(当然还有更多行):
| cat |
rating |
| a |
0.10000 |
| a |
0.64444 |
| b |
0.50000 |
| b |
0.12222 |
| c |
7.50000 |
| c |
8.88889 |
我们可以使用该函数来绘制与 CI 的均值差异:
plot_diff_in_means(data = test_dat, col1 = 'cat', col2 = 'rating')
这为我们提供了以下图表: