【发布时间】:2021-08-11 16:52:50
【问题描述】:
我的数据集比这个大很多(大约 3000 行 * 50 列),我只是在这里放一个样本。这是一个包含每一行信息的数据框。基本上,我打算分析每个标签的属性,比如Level 3可能有更高的年收入;或者什么有助于更高的水平。哪些统计函数可能适合分析它?我正在尝试使用sklearn.preprocessing.OrdinalEncoder() 来标记每个类别变量,并尝试使用stats.chi2.ppf() 或相关矩阵之类的东西。不确定它们是否适用于我的情况。
example = pd.DataFrame(
{
"Degree": ['Graduate', 'Undergraduate', 'Undergraduate', 'Graduate', 'Undergraduate', 'Doctorate'],
"Age": ['Age 26-35','Age 18-25','Age 18-25','Age 18-25', 'Age 26-35', 'Older than 35'],
"Location": ['VA','DC','DC','CA','DC','MA'],
"Gender": ['male','male','female','male','male','female'],
"Annual Income": ['\$5,001 - \$10,000','<$5,000','\$15,001 - \$25,000','>\$50,000','<\$5,000','\$15,001 - \$25,000'],
"Level": [0,1,2,0,0,3],
}
)
Degree Age Location Gender Annual Income Level
0 Graduate Age 26-35 VA male $5,001 - $10,000 0
1 Undergraduate Age 18-25 DC male <$5,000 1
2 Undergraduate Age 18-25 DC female $15,001 - $25,000 2
3 Graduate Age 18-25 CA male >$50,000 0
4 Undergraduate Age 26-35 DC male <$5,000 0
5 Doctorate Older than 35 MA female $15,001 - $25,000 3
欢迎任何想法和cmets。
【问题讨论】:
标签: python pandas dataframe data-analysis stat