利用机器学习测试你的心动指数
数据集下载
[1]:
!wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data%20Key.doc
--2020-08-12 11:06:06-- https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data%20Key.doc Resolving pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)... 47.95.85.22 Connecting to pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)|47.95.85.22|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 161792 (158K) [application/msword] Saving to: 'Speed Dating Data Key.doc' 100%[======================================>] 161,792 --.-K/s in 0.07s 2020-08-12 11:06:06 (2.08 MB/s) - 'Speed Dating Data Key.doc' saved [161792/161792]
[2]:
!wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data.csv
--2020-08-12 11:06:07-- https://pai-public-data.oss-cn-beijing.aliyuncs.com/speed_dating/Speed%20Dating%20Data.csv Resolving pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)... 47.95.85.22 Connecting to pai-public-data.oss-cn-beijing.aliyuncs.com (pai-public-data.oss-cn-beijing.aliyuncs.com)|47.95.85.22|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 5192296 (5.0M) [text/csv] Saving to: 'Speed Dating Data.csv' 100%[======================================>] 5,192,296 18.7MB/s in 0.3s 2020-08-12 11:06:08 (18.7 MB/s) - 'Speed Dating Data.csv' saved [5192296/5192296]
安装以及引入调用的包
[3]:
!pip install palettable --user
Collecting palettable Downloading https://mirrors.aliyun.com/pypi/packages/ca/46/5198aa24e61bb7eef28d06cb69e56bfa1942f4b6807d95a0b5ce361fe09b/palettable-3.3.0-py2.py3-none-any.whl (111kB) 100% |################################| 112kB 1.1MB/s Installing collected packages: palettable Successfully installed palettable-3.3.0 You are using pip version 9.0.1, however version 20.2.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
[4]:
!pip install imbalanced-learn --user
Collecting imbalanced-learn Downloading https://mirrors.aliyun.com/pypi/packages/c8/81/8db4d87b03b998fda7c6f835d807c9ae4e3b141f978597b8d7f31600be15/imbalanced_learn-0.7.0-py3-none-any.whl (167kB) 100% |################################| 174kB 11.0MB/s Requirement already satisfied: scikit-learn>=0.23 in /home/admin/.local/lib/python3.6/site-packages (from imbalanced-learn) Requirement already satisfied: scipy>=0.19.1 in /opt/conda/lib/python3.6/site-packages (from imbalanced-learn) Requirement already satisfied: joblib>=0.11 in /home/admin/.local/lib/python3.6/site-packages (from imbalanced-learn) Requirement already satisfied: numpy>=1.13.3 in /home/admin/.local/lib/python3.6/site-packages (from imbalanced-learn) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/admin/.local/lib/python3.6/site-packages (from scikit-learn>=0.23->imbalanced-learn) Installing collected packages: imbalanced-learn Successfully installed imbalanced-learn-0.7.0 You are using pip version 9.0.1, however version 20.2.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
[6]:
!pip install seaborn --user
Collecting seaborn Downloading https://mirrors.aliyun.com/pypi/packages/c7/e6/54aaaafd0b87f51dfba92ba73da94151aa3bc179e5fe88fc5dfb3038e860/seaborn-0.10.1-py3-none-any.whl (215kB) 100% |################################| 225kB 13.2MB/s Requirement already satisfied: matplotlib>=2.1.2 in /opt/conda/lib/python3.6/site-packages (from seaborn) Requirement already satisfied: numpy>=1.13.3 in /home/admin/.local/lib/python3.6/site-packages (from seaborn) Requirement already satisfied: pandas>=0.22.0 in /opt/conda/lib/python3.6/site-packages (from seaborn) Requirement already satisfied: scipy>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from seaborn) Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib>=2.1.2->seaborn) Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas>=0.22.0->seaborn) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.1->matplotlib>=2.1.2->seaborn) Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) Installing collected packages: seaborn Successfully installed seaborn-0.10.1 You are using pip version 9.0.1, however version 20.2.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
[8]:
# importing packages
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
import imblearn
from palettable.colorbrewer.qualitative import Pastel1_3
EDA 探索性数据分析
[9]:
df = pd.read_csv('Speed Dating Data.csv', encoding='gbk')
df.head()
[9]:
| iid | id | gender | idg | condtn | wave | round | position | positin1 | order | ... | attr3_3 | sinc3_3 | intel3_3 | fun3_3 | amb3_3 | attr5_3 | sinc5_3 | intel5_3 | fun5_3 | amb5_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 4 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 3 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 10 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 5 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 1 | 1.0 | 0 | 1 | 1 | 1 | 10 | 7 | NaN | 7 | ... | 5.0 | 7.0 | 7.0 | 7.0 | 7.0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 195 columns
[10]:
print(df.shape)
(8378, 195)
[11]:
# 计算出每个特征有多少百分比是缺失的
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({
'column_name': df.columns,
'percent_missing': percent_missing
})
[12]:
missing_value_df.sort_values(by='percent_missing', ascending=False).head(10)
[12]:
| column_name | percent_missing | |
|---|---|---|
| num_in_3 | num_in_3 | 92.026737 |
| numdat_3 | numdat_3 | 82.143710 |
| expnum | expnum | 78.515159 |
| sinc7_2 | sinc7_2 | 76.665075 |
| amb7_2 | amb7_2 | 76.665075 |
| shar7_2 | shar7_2 | 76.438291 |
| attr7_2 | attr7_2 | 76.318931 |
| intel7_2 | intel7_2 | 76.318931 |
| fun7_2 | fun7_2 | 76.318931 |
| amb5_3 | amb5_3 | 75.936978 |
多少人通过Speed Dating找到了对象
[13]:
# 多少人通过Speed Dating找到了对象
plt.subplots(figsize=(3,3), dpi=110,)
# 构造数据
size_of_groups=df.match.value_counts().values
single_percentage = round(size_of_groups[0]/sum(size_of_groups) * 100,2)
matched_percentage = round(size_of_groups[1]/sum(size_of_groups)* 100,2)
names = [
'Single:' + str(single_percentage) + '%',
'Matched' + str(matched_percentage) + '%']
# 创建饼图
plt.pie(
size_of_groups,
labels=names,
labeldistance=1.2,
colors=Pastel1_3.hex_colors
)
plt.show()
[ ]:
[14]:
# 多少女生通过Speed Dating找到了对象
plt.subplots(figsize=(3,3), dpi=110,)
# 构造数据
size_of_groups=df[df.gender == 0].match.value_counts().values
single_percentage = round(size_of_groups[0]/sum(size_of_groups) * 100,2)
matched_percentage = round(size_of_groups[1]/sum(size_of_groups)* 100,2)
names = [
'Single:' + str(single_percentage) + '%',
'Matched' + str(matched_percentage) + '%']
# 创建饼图
plt.pie(
size_of_groups,
labels=names,
labeldistance=1.2,
colors=Pastel1_3.hex_colors
)
plt.show()
[15]:
# 多少男生通过Speed Dating找到了对象
plt.subplots(figsize=(3,3), dpi=110,)
# 构造数据
size_of_groups=df[df.gender == 1].match.value_counts().values
single_percentage = round(size_of_groups[0]/sum(size_of_groups) * 100,2)
matched_percentage = round(size_of_groups[1]/sum(size_of_groups)* 100,2)
names = [
'Single:' + str(single_percentage) + '%',
'Matched' + str(matched_percentage) + '%']
# 创建饼图
plt.pie(
size_of_groups,
labels=names,
labeldistance=1.2,
colors=Pastel1_3.hex_colors
)
plt.show()
年龄分布
[16]:
# 年龄分布
age = df[np.isfinite(df['age'])]['age']
plt.hist(age,bins=35)
plt.xlabel('Age')
plt.ylabel('Frequency')
[16]:
Text(0, 0.5, 'Frequency')
[17]:
date_df = df[[
'iid', 'gender', 'pid', 'match', 'int_corr', 'samerace', 'age_o',
'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb',
'pf_o_sha', 'dec_o', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'like_o',
'prob_o', 'met_o', 'age', 'race', 'imprace', 'imprelig', 'goal', 'date',
'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining',
'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv',
'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'attr1_1',
'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'attr3_1', 'sinc3_1',
'fun3_1', 'intel3_1', 'dec', 'attr', 'sinc', 'intel', 'fun', 'like',
'prob', 'met'
]]
[18]:
# heatmap
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("Correlation Heatmap")
corr = date_df.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9662c23f98>
模型构建
数据准备
[19]:
# preparing the data
clean_df = df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o','match']]
clean_df.dropna(inplace=True)
X=clean_df[['attr_o','sinc_o','intel_o','fun_o','amb_o','shar_o',]]
y=clean_df['match']
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until
[20]:
oversample = imblearn.over_sampling.SVMSMOTE()
X, y = oversample.fit_resample(X, y)
[21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
模型训练
[22]:
# logistic regression classification model
model = LogisticRegression(C=1, random_state=0)
lrc = model.fit(X_train, y_train)
predict_train_lrc = lrc.predict(X_train)
predict_test_lrc = lrc.predict(X_test)
print('Training Accuracy:', metrics.accuracy_score(y_train, predict_train_lrc))
print('Validation Accuracy:', metrics.accuracy_score(y_test, predict_test_lrc))
Training Accuracy: 0.765040825096691 Validation Accuracy: 0.7555841924398625
测试结果
[24]:
lrc.predict_proba([[8.0,6.0,7.0,7.0,6.0,8.0,]])
[24]:
array([[0.29710471, 0.70289529]])
[ ]: