【问题标题】:Is there a way to analyze the impact/correlation of categorical variables to the label in python?有没有办法分析分类变量对python中标签的影响/相关性?
【发布时间】:2021-08-11 16:52:50
【问题描述】:

我的数据集比这个大很多(大约 3000 行 * 50 列),我只是在这里放一个样本。这是一个包含每一行信息的数据框。基本上,我打算分析每个标签的属性,比如Level 3可能有更高的年收入;或者什么有助于更高的水平。哪些统计函数可能适合分析它?我正在尝试使用sklearn.preprocessing.OrdinalEncoder() 来标记每个类别变量,并尝试使用stats.chi2.ppf() 或相关矩阵之类的东西。不确定它们是否适用于我的情况。

example = pd.DataFrame(
    {
        "Degree": ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Undergraduate', 'Doctorate'],
        "Age": ['Age 26-35','Age 18-25','Age 18-25','Age 18-25', 'Age 26-35', 'Older than 35'],
        "Location": ['VA','DC','DC','CA','DC','MA'],
        "Gender": ['male','male','female','male','male','female'],
        "Annual Income": ['\$5,001 - \$10,000','<$5,000','\$15,001 - \$25,000','>\$50,000','<\$5,000','\$15,001 - \$25,000'],
        "Level": [0,1,2,0,0,3],
    }
)

Degree  Age Location    Gender  Annual Income   Level
0   Graduate    Age 26-35   VA  male    $5,001 - $10,000    0
1   Undergraduate   Age 18-25   DC  male    <$5,000 1
2   Undergraduate   Age 18-25   DC  female  $15,001 - $25,000   2
3   Graduate    Age 18-25   CA  male    >$50,000    0
4   Undergraduate   Age 26-35   DC  male    <$5,000 0
5   Doctorate   Older than 35   MA  female  $15,001 - $25,000   3

欢迎任何想法和cmets。

【问题讨论】:

    标签: python pandas dataframe data-analysis stat


    【解决方案1】:
    import pandas as pd
    example = pd.DataFrame(
        {
            "Degree": ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Undergraduate', 'Doctorate'],
            "Age": ['Age 26-35','Age 18-25','Age 18-25','Age 18-25', 'Age 26-35', 'Older than 35'],
            "Location": ['VA','DC','DC','CA','DC','MA'],
            "Gender": ['male','male','female','male','male','female'],
            "Annual Income": ['\$5,001 - \$10,000','<$5,000','\$15,001 - \$25,000','>\$50,000','<\$5,000','\$15,001 - \$25,000'],
            "Level": [0,1,2,0,0,3],
        }
    )
    unique_items = []
    for key in example:
        unique_items.append(example[key].unique())
    for item in unique_items:
        print(item)
    # figure out how to sort each unique item,
    # for example, degree= by more education
    #              income = ascending
    #              level = ascending , etc
    # now use the index as the value and you can start to do math and pictures
    # Analyze for me means:
    # Now what you would do is pick any two and scatterplot it to see if there is a relationship
    # then pick all pairs for any one and make a collage of thumbnail scatterplots
    # then measure correlation or other math properties that put them in groupings you like
    # think of this like categorizing galaxies, 
    # straight lines sloping up is one type that would be high on the list
    # but randomness is another type and some might look like butterflies
    # then sort by correlation and groupings to show all the strongest top 100 list
    # good luck ;)
    

    【讨论】:

      【解决方案2】:

      在相关性中我建议使用seaborn

      热图用于显示两个变量之间的关系,一个 绘制在每个轴上。通过观察每个单元格颜色如何变化 轴,您可以观察一个或一个的值是否存在任何模式 两个变量。

      https://chartio.com/learn/charts/heatmap-complete-guide/

      import seaborn as sns
      sns.heatmap(example[['Level']])
      

      但是对于热图 - 需要整数 - 所以 Annual IncomeAge 可以转换为整数。

      有一些近似值(从收入和年龄获取第一个数字 - 不是范围 - 也可以使用。.mean()):

      example['Income'] = example['Annual Income'].str.extract('(\d+)')
      example['Age'] = example['Age'].str.extract('(\d+)')
      example['Income'] = pd.to_numeric(example['Income'])
      example['Age'] = pd.to_numeric(example['Age'])
      
      
      import seaborn as sns
      sns.heatmap(example[['Level', 'Income', 'Age']])
      

      年收入年龄为整数时——还有.corr()函数的选项:

      example.corr()
      

      【讨论】:

        猜你喜欢
        • 2015-09-06
        • 2017-09-09
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-28
        • 2017-11-16
        • 2015-04-03
        • 2021-06-22
        相关资源
        最近更新 更多