决策树实验
1.准备数据(E:\MachineLearning-data\AllElectronics.csv)
| RID | age | Income | student | credit_rating | Class_buys_computer | |
| 1 | youth | high | no | fair | no | |
| 2 | youth | high | no | excellent | no | |
| 3 | middle_aged | high | no | fair | yes | |
| 4 | senior | medium | no | fair | yes | |
| 5 | senior | low | yes | fair | yes | |
| 6 | senior | low | yes | excellent | no | |
| 7 | middle_aged | low | yes | excellent | yes | |
| 8 | youth | medium | no | fair | no | |
| 9 | youth | low | yes | fair | yes | |
| 10 | senior | medium | yes | fair | yes | |
| 11 | youth | medium | yes | excellent | yes | |
| 12 | middle_aged | medium | no | excellent | yes | |
| 13 | middle_aged | high | yes | fair | yes | |
| 14 | senior | medium | no | excellent | no | |
2.实验代码
# -*- coding: utf-8 -*- # coding=utf-8 # 实现决策树并进行预测 from sklearn.feature_extraction import DictVectorizer import csv from sklearn import preprocessing from sklearn import tree #1.读取数据,rt模式下,python在读取文本时会自动把\r\n转换成\n.,设置编码格式与文档统一 allElectronicsData = open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8") reader = csv.reader(allElectronicsData) headers = next(reader) #读出数据的属性名 print(headers) #2.存放数据 #featuresList:将属性:age、 Income、student、 credit_rating、的值存放在列表中, #labelList:分类的结果存放在列表 featuresList = [] labelList = [] for row in reader: labelList.append(row[len(row) - 1]) rowDict = {} for i in range(1, len(row) - 1): rowDict[headers[i]] = row[i] featuresList.append(rowDict) #3.将数据向量化 vec = DictVectorizer() dummyX = vec.fit_transform(featuresList).toarray() print("dummyX:" + str(dummyX)) #输出属性的类别 print(vec.get_feature_names()) #输出训练集分类结果 print("labelList:" + str(labelList)) #4.将训练集结果进行数据化处理 lb = preprocessing.LabelBinarizer() dummyY = lb.fit_transform(labelList) print("dummyY:" + str(dummyY)) #5.属性设置结束,设置决策树构造参数 clf = tree.DecisionTreeClassifier(criterion='entropy') clf = clf.fit(dummyX, dummyY) print("clf:" + str(clf)) #6.将结果写入文件中 with open("E:\MachineLearning-data\AllElectronicInformationGainOri.dot", 'w') as f: f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f) #7.给定数据,进行预测,读出第一条数据(一行) oneRowX = dummyX[0, :] print("oneRowX: " + str(oneRowX)) #修改数据中的值 newRowX = oneRowX newRowX[0] = 1 newRowX[2] = 1 print("newRowX: " + str(newRowX)) #8.给出预测结果 predictedY = clf.predict(newRowX) print("predictedY: " + str(predictedY))
3.实验结果
"D:\Program Files\Python\Anaconda\python.exe" E:/Python/machinelearning/01.py ['\ufeffRID', 'age', 'Income', 'student', 'credit_rating', 'Class_buys_computer'] dummyX:[[ 1. 0. 0. 0. 0. 1. 0. 1. 1. 0.] [ 1. 0. 0. 0. 0. 1. 1. 0. 1. 0.] [ 1. 0. 0. 1. 0. 0. 0. 1. 1. 0.] [ 0. 0. 1. 0. 1. 0. 0. 1. 1. 0.] [ 0. 1. 0. 0. 1. 0. 0. 1. 0. 1.] [ 0. 1. 0. 0. 1. 0. 1. 0. 0. 1.] [ 0. 1. 0. 1. 0. 0. 1. 0. 0. 1.] [ 0. 0. 1. 0. 0. 1. 0. 1. 1. 0.] [ 0. 1. 0. 0. 0. 1. 0. 1. 0. 1.] [ 0. 0. 1. 0. 1. 0. 0. 1. 0. 1.] [ 0. 0. 1. 0. 0. 1. 1. 0. 0. 1.] [ 0. 0. 1. 1. 0. 0. 1. 0. 1. 0.] [ 1. 0. 0. 1. 0. 0. 0. 1. 0. 1.] [ 0. 0. 1. 0. 1. 0. 1. 0. 1. 0.]] ['Income=high', 'Income=low', 'Income=medium', 'age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'student=no', 'student=yes'] labelList:['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no'] dummyY:[[0]
[0] [1] [1] [1] [0] [1] [0] [1] [1] [1] [1] [1] [0]] clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') oneRowX: [ 1. 0. 0. 0. 0. 1. 0. 1. 1. 0.] newRowX: [ 1. 0. 1. 0. 0. 1. 0. 1. 1. 0.] predictedY: [0] D:\Program Files\Python\Anaconda\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
4.将dot文件转化为pdf输出(命令为:dot -Tpdf E:\MachineLearning-data\AllElectronics.dot -o E:\MachineLearning-data\AllElectronics.pdf)
其中将dot转化为pdf的软件graphviz在9中进行详述;
5.错误总结
1..错误1
python读取文件时提示"UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 205: illegal multibyte sequence"
解决办法1.
FILE_OBJECT= open('order.log','r', encoding='UTF-8')
解决办法2.
FILE_OBJECT= open('order.log','rb')
2..错误2
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
原因:循环的数据不应该是二进制数据
open('E:\MachineLearning-data\AllElectronics.csv', 'rb',encoding="utf-8")
解决方案:
open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8")
说明:rb:以二进制格式打开一个文件用于只读
rt:读文件,python在读取文本时会自动把\r\n转换成\n
3..错误3
.csv文件编码必须与读写时的编码格式相符合;
6.安装graphviz
1)下载:第一个为安装版,第二个为免安装版
2)安装配置环境变量
a.配置环境变量(系统变量PATH中添加)
b.检测是否安装正确