1.利用决策树算法对train_feature.csv进行训练对test_feature.csv进行预测(练习调参),并计算预测正确的准确率。(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量
import pandas as pd
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
import time,datetime
train_df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\train_feature.csv")
train_df.head()
|
ip |
app |
device |
os |
channel |
is_attributed |
day |
hour |
minute |
ip_count |
app_count |
device_count |
os_count |
channel_count |
hour_count |
minute_count |
| 0 |
83230 |
3 |
1 |
13 |
379 |
0 |
2017-11-06 |
14 |
32 |
938 |
774123 |
6527713 |
1541988 |
101195 |
48 |
110457 |
| 1 |
17357 |
3 |
1 |
19 |
379 |
0 |
2017-11-06 |
14 |
33 |
677 |
774123 |
6527713 |
1644220 |
101195 |
48 |
112948 |
| 2 |
35810 |
3 |
1 |
13 |
379 |
0 |
2017-11-06 |
14 |
34 |
351 |
774123 |
6527713 |
1541988 |
101195 |
48 |
112532 |
| 3 |
45745 |
14 |
1 |
13 |
478 |
0 |
2017-11-06 |
14 |
34 |
7786 |
316214 |
6527713 |
1541988 |
11355 |
48 |
112532 |
| 4 |
161007 |
3 |
1 |
13 |
379 |
0 |
2017-11-06 |
14 |
35 |
132 |
774123 |
6527713 |
1541988 |
101195 |
48 |
115570 |
test_df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\test_feature.csv")
test_df.head()
|
click_id |
ip |
app |
device |
os |
channel |
is_attributed |
day |
hour |
minute |
ip_count |
app_count |
device_count |
os_count |
channel_count |
hour_count |
minute_count |
| 0 |
0 |
19870 |
2 |
1 |
13 |
435 |
0 |
2017-11-06 |
23 |
1 |
99 |
308059 |
2853433 |
657790 |
42678 |
2308568 |
68675 |
| 1 |
1 |
50314 |
15 |
1 |
17 |
265 |
0 |
2017-11-06 |
23 |
1 |
233 |
307505 |
2853433 |
153419 |
68057 |
2308568 |
68675 |
| 2 |
2 |
183513 |
15 |
1 |
13 |
153 |
0 |
2017-11-06 |
23 |
1 |
105 |
307505 |
2853433 |
657790 |
104935 |
2308568 |
68675 |
| 3 |
3 |
35731 |
12 |
1 |
19 |
178 |
0 |
2017-11-06 |
23 |
1 |
550 |
348786 |
2853433 |
765928 |
89744 |
2308568 |
68675 |
| 4 |
4 |
186444 |
12 |
1 |
3 |
265 |
0 |
2017-11-06 |
23 |
1 |
16 |
348786 |
2853433 |
45955 |
68057 |
2308568 |
68675 |
(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)
train_df["is_attributed"].value_counts()
0 6986725
1 13275
Name: is_attributed, dtype: int64
tmp_is1 = train_df[train_df['is_attributed']==1]
tmp_is0 = train_df[train_df['is_attributed']==0]
tmp_is0 = tmp_is0.sample(n=tmp_is1.shape[0]*5)
train_df= tmp_is1.append(tmp_is0)
删除‘day’列
print(train_df["day"].value_counts())
print(test_df["day"].value_counts())
2017-11-06 79650
Name: day, dtype: int64
2017-11-06 2308568
2017-11-07 691432
Name: day, dtype: int64
train_df1=train_df.drop(['day'],axis=1)
test_df1=test_df.drop(['day'],axis=1)
test_df1=test_df1.drop(['click_id'],axis=1)
y_train=train_df1[['is_attributed']].values
y_test=test_df1[['is_attributed']].values
x_train=train_df1.drop(['is_attributed'],axis=1)
x_test=test_df1.drop(['is_attributed'],axis=1)
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
print("训练数据的score:", (clf.score(x_train, y_train)))
print("测试数据的score:", (clf.score(x_test, y_test)))
训练数据的score: 0.9996359070935342
测试数据的score: 0.8894056666666666
predict=clf.predict(x_test)
submission = pd.DataFrame ( {
'click_id':test_df['click_id'],
'is_attributed':predict
} )
说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量
print("准确率:",sum((predict == 1) & (test_df.is_attributed==1)) / sum(test_df.is_attributed==1))
准确率: 0.7458654906284454