Pandas 中多个特征的卡方检验答案

【问题标题】：chi-square test for multiple features in PandasPandas 中多个特征的卡方检验
【发布时间】：2019-12-04 04:22:18
【问题描述】：

我有一个像这样的示例数据框

m_list = ['male','male','female','female']
whiskey_list = ['alcohol','no_alcohol','alcohol','no_alcohol']
f1 = [273,62,60,7]
f2 = [276,61,57,8]
l = [m_list,whiskey_list,f1,f2]
test_df = pd.DataFrame(l).T
test_df.columns = ['gender','drink_category','f1','f2']


    gender  drink_category  f1  f2
0   male    alcohol         273 276
1   male    no_alcohol      62  61
2   female  alcohol         60  57
3   female  no_alcohol      7   8

我想使用卡方检验查看 gender 和 drink_category 两个类别之间是否存在任何关系。为此，我想为范围从f1,f2....fn 的每个功能构建一个列联表，然后为每个功能计算p-values。

这里的示例只有 f1 和 f2 两个功能，但总的来说我有很多。

当我处理f1 时，我的列联表看起来像 -

gender   alcohol   no_alcohol
male      273        62
female    60         7

然后我会计算f1 的 p 值。

当我处理f2 时，我的列联表看起来像 -

gender   alcohol   no_alcohol
male      276        61
female    57         8

我如何使用 pandas 和 scipy 库计算这个？

最后，我想要一个数据框，其中每个特征都有 p 值 f1 到 fn。

【问题讨论】：

标签： python pandas

【解决方案1】：

我们可以使用 scipy.stat 的 chi2_contingency 来获取使用 pandas 的 pivot 函数构建的列联表的 p 值。

import pandas as pd
from scipy.stats import chi2_contingency

test_df = pd.DataFrame({'gender': ['male','male','female','female'],
                        'drink_category': ['alcohol','no_alcohol','alcohol','no_alcohol'],
                        'f1': [273,62,60,7],
                        'f2': [276,61,57,8]})

p = pd.Series()
for feature in [c for c in test_df.columns if c.startswith('f')]:
   _,p[feature],_,_ = chi2_contingency(test_df.pivot('gender','drink_category',feature))

print(p)

输出：

f1    0.155699
f2    0.339842
dtype: float64

【讨论】：