题目:
思路:
1、数据集的建立使用函数:
datasets.make_classification(n_samples,n_features,n_informative,n_redundant,n_repeated,n_classes)
2、使用10倍交叉验证分割数据集使用函数:
cross_validation.KFold(length,n_folds,shuffle)
3、算法的训练:利用分割的数据集配合自己定义好的算法进行训练,自己定义的算法包括了计算accuracy,F1-score,AUC ROC。
实验代码:
from sklearn import cross_validation from sklearn import datasets from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn import metrics import numpy as np performance = np.ndarray(shape=(10, 3, 3)) def Gaussian_naive_Bayes(X_train, y_train): clf = GaussianNB() clf.fit(X_train, y_train) pred = clf.predict(X_test) return metric(y_test, pred) def SVM(X_train, y_train): clf = SVC(C=1e-01, kernel='rbf', gamma=0.1) clf.fit(X_train, y_train) pred = clf.predict(X_test) return metric(y_test, pred) def Random_Forest(X_train, y_train): clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train, y_train) pred = clf.predict(X_test) return metric(y_test, pred) def metric(y_test, pred): acc = metrics.accuracy_score(y_test, pred) f1 = metrics.f1_score(y_test, pred) auc = metrics.roc_auc_score(y_test, pred) return acc, f1, auc dataset = datasets.make_classification(n_samples=1000, n_features=10, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2) kf = cross_validation.KFold(len(dataset[0]), n_folds=10, shuffle=True) i = 0 for train_index, test_index in kf: X_test, y_test = dataset[0][test_index], dataset[1][test_index] performance[i, 0, :] = Gaussian_naive_Bayes(dataset[0][train_index], dataset[1][train_index]) performance[i, 1, :] = SVM(dataset[0][train_index], dataset[1][train_index]) performance[i, 2, :] = Random_Forest(dataset[0][train_index], dataset[1][train_index]) i += 1 name = ['GaussianNB', 'SVC', 'RandomForestClassifier'] mean = np.mean(performance, axis=0) for i in list(range(0, 3)): print(name[i]) print(' Accuracy: ', performance[:, i, 0], ' Averaged: ', mean[i, 0]) print(' F1-score: ', performance[:, i, 1], ' Averaged: ', mean[i, 1]) print(' AUC ROC: ', performance[:, i, 2], ' Averaged: ', mean[i, 2], '\n')
实验结果:
可以看出,效果方面:随机森林>SVC>GaussianNB