有没有办法分析分类变量对python中标签的影响/相关性？答案

【问题标题】：Is there a way to analyze the impact/correlation of categorical variables to the label in python?有没有办法分析分类变量对python中标签的影响/相关性？
【发布时间】：2021-08-11 16:52:50
【问题描述】：

我的数据集比这个大很多（大约 3000 行 * 50 列），我只是在这里放一个样本。这是一个包含每一行信息的数据框。基本上，我打算分析每个标签的属性，比如Level 3可能有更高的年收入；或者什么有助于更高的水平。哪些统计函数可能适合分析它？我正在尝试使用sklearn.preprocessing.OrdinalEncoder() 来标记每个类别变量，并尝试使用stats.chi2.ppf() 或相关矩阵之类的东西。不确定它们是否适用于我的情况。

example = pd.DataFrame(
    {
        "Degree": ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Undergraduate', 'Doctorate'],
        "Age": ['Age 26-35','Age 18-25','Age 18-25','Age 18-25', 'Age 26-35', 'Older than 35'],
        "Location": ['VA','DC','DC','CA','DC','MA'],
        "Gender": ['male','male','female','male','male','female'],
        "Annual Income": ['\$5,001 - \$10,000','<$5,000','\$15,001 - \$25,000','>\$50,000','<\$5,000','\$15,001 - \$25,000'],
        "Level": [0,1,2,0,0,3],
    }
)

Degree  Age Location    Gender  Annual Income   Level
0   Graduate    Age 26-35   VA  male    $5,001 - $10,000    0
1   Undergraduate   Age 18-25   DC  male    <$5,000 1
2   Undergraduate   Age 18-25   DC  female  $15,001 - $25,000   2
3   Graduate    Age 18-25   CA  male    >$50,000    0
4   Undergraduate   Age 26-35   DC  male    <$5,000 0
5   Doctorate   Older than 35   MA  female  $15,001 - $25,000   3

欢迎任何想法和cmets。

【问题讨论】：

标签： python pandas dataframe data-analysis stat

【解决方案1】：

import pandas as pd
example = pd.DataFrame(
    {
        "Degree": ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Undergraduate', 'Doctorate'],
        "Age": ['Age 26-35','Age 18-25','Age 18-25','Age 18-25', 'Age 26-35', 'Older than 35'],
        "Location": ['VA','DC','DC','CA','DC','MA'],
        "Gender": ['male','male','female','male','male','female'],
        "Annual Income": ['\$5,001 - \$10,000','<$5,000','\$15,001 - \$25,000','>\$50,000','<\$5,000','\$15,001 - \$25,000'],
        "Level": [0,1,2,0,0,3],
    }
)
unique_items = []
for key in example:
    unique_items.append(example[key].unique())
for item in unique_items:
    print(item)
# figure out how to sort each unique item,
# for example, degree= by more education
#              income = ascending
#              level = ascending , etc
# now use the index as the value and you can start to do math and pictures
# Analyze for me means:
# Now what you would do is pick any two and scatterplot it to see if there is a relationship
# then pick all pairs for any one and make a collage of thumbnail scatterplots
# then measure correlation or other math properties that put them in groupings you like
# think of this like categorizing galaxies, 
# straight lines sloping up is one type that would be high on the list
# but randomness is another type and some might look like butterflies
# then sort by correlation and groupings to show all the strongest top 100 list
# good luck ;)

【讨论】：

【解决方案2】：

在相关性中我建议使用seaborn。

热图用于显示两个变量之间的关系，一个绘制在每个轴上。通过观察每个单元格颜色如何变化轴，您可以观察一个或一个的值是否存在任何模式两个变量。

https://chartio.com/learn/charts/heatmap-complete-guide/

import seaborn as sns
sns.heatmap(example[['Level']])

但是对于热图 - 需要整数 - 所以 Annual Income 和 Age 可以转换为整数。

有一些近似值（从收入和年龄获取第一个数字 - 不是范围 - 也可以使用。.mean()）：

example['Income'] = example['Annual Income'].str.extract('(\d+)')
example['Age'] = example['Age'].str.extract('(\d+)')
example['Income'] = pd.to_numeric(example['Income'])
example['Age'] = pd.to_numeric(example['Age'])


import seaborn as sns
sns.heatmap(example[['Level', 'Income', 'Age']])

当年收入和年龄为整数时——还有.corr()函数的选项：

example.corr()

【讨论】：