【天池】新人赛-快来一起挖掘幸福感！

step_1：目标确定

　　通过问卷调查数据，选取其中多组变量来预测其对幸福感的评价。

step_2：数据获取

　　连接：

　　　　https://tianchi.aliyun.com/competition/entrance/231702/information

　　下载：

　　　　train_set：happiness_train_complete.csv

　　　　test_set：happiness_test_complete.csv

　　　　index：文件中包含每个变量对应的问卷题目，以及变量取值的含义

　　　　survey：文件是数据源的原版问卷，作为补充以方便理解问题背景

step_3：train_set数据清洗和整理

　　使用matplotlib.pyplot依次画出id和其它列的scatter图

　　【天池】新人赛-快来一起挖掘幸福感！

　　通过图对数据进行操作：

happiness是样本标签(预测模型的真实值)，通过问卷发现其类别只有1，2，3，4，5，通过图发现有-8，应当删除值为-8这些噪音数据
删除id、survey_time、edu_other、join_party、property_other、invest_other列
其它列所有小于0的值和空值均设置为-8
均值归一化

# jupyter notebook下运行

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 导入训练数据集和测试集
# encoding='gbk'，不能用utf-8
train_data = pd.read_csv('happiness_train_complete.csv', encoding='gbk')
test_data = pd.read_csv('happiness_test_complete.csv', encoding='gbk')

# 训练集样本个数8000，每个样本含有140个特征
# 测试集样本个数2968，每个样本含有139个特征
train_data.shape
test_data.shape

# 去除-8值
train_data = train_data[train_data.happiness>0]
train_data.shape

# 训练集标签
y = train_data.happiness

ind1 = ['id','happiness','survey_time','edu_other','join_party','property_other','invest_other']
# 训练集样本中删除指定列数据
X = train_data.drop(ind, axis=1)

# 删除测试集中删除指定列数据
ind2 = ['id','survey_time','edu_other','join_party','property_other','invest_other']
X_test_data = test_data.drop(ind, axis=1)

# 把DateFrame类型转为np.array
y = np.array(y, dtype=int)
X = np.array(X, dtype=float)
X_test_data = np.array(X_test_data, dtype=float)

# 把小于0的值设置为-8
X[X<0]=-8
X_test_data[X_test_data<0]=-8

from sklearn.impute import SimpleImputer

# 把样本中的值为空的特征设置为-8
X = SimpleImputer(fill_value=-8).fit_transform(X)
X_test_data = SimpleImputer(fill_value=-8).fit_transform(X_test_data)

from sklearn.model_selection import train_test_split

# 因为测试集没有标签，所以拆分训练集
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=666)

# 均值归一化
from sklearn.preprocessing import StandardScaler

std = StandardScaler().fit(X_train)
X_train_std = std.transform(X_train)
X_test_std = std.transform(X_test)
std_1 = StandardScaler().fit(X)
X_std = std_1.transform(X)
X_test_data = std_1.transform(X_test_data)

View Code