决策树（实践）

决策树实验

1.准备数据（E:\MachineLearning-data\AllElectronics.csv）

RID	age	Income	student	credit_rating	Class_buys_computer
1	youth	high	no	fair	no
2	youth	high	no	excellent	no
3	middle_aged	high	no	fair	yes
4	senior	medium	no	fair	yes
5	senior	low	yes	fair	yes
6	senior	low	yes	excellent	no
7	middle_aged	low	yes	excellent	yes
8	youth	medium	no	fair	no
9	youth	low	yes	fair	yes
10	senior	medium	yes	fair	yes
11	youth	medium	yes	excellent	yes
12	middle_aged	medium	no	excellent	yes
13	middle_aged	high	yes	fair	yes
14	senior	medium	no	excellent	no

2.实验代码

# -*- coding: utf-8 -*-
# coding=utf-8

# 实现决策树并进行预测
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import preprocessing
from sklearn import tree

#1.读取数据，rt模式下，python在读取文本时会自动把\r\n转换成\n.，设置编码格式与文档统一
allElectronicsData = open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8")
reader = csv.reader(allElectronicsData)
headers = next(reader)
#读出数据的属性名
print(headers)

#2.存放数据
#featuresList：将属性：age、 Income、student、 credit_rating、的值存放在列表中,
#labelList：分类的结果存放在列表
featuresList = []
labelList = []

for row in reader:
    labelList.append(row[len(row) - 1])
    rowDict = {}
    for i in range(1, len(row) - 1):
        rowDict[headers[i]] = row[i]
    featuresList.append(rowDict)

#3.将数据向量化
vec = DictVectorizer()
dummyX = vec.fit_transform(featuresList).toarray()
print("dummyX:" + str(dummyX))
#输出属性的类别
print(vec.get_feature_names())
#输出训练集分类结果
print("labelList:" + str(labelList))
#4.将训练集结果进行数据化处理
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY:" + str(dummyY))
#5.属性设置结束，设置决策树构造参数
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)
print("clf:" + str(clf))
#6.将结果写入文件中
with open("E:\MachineLearning-data\AllElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
#7.给定数据，进行预测，读出第一条数据（一行）
oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))
#修改数据中的值
newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 1
print("newRowX: " + str(newRowX))
#8.给出预测结果
predictedY = clf.predict(newRowX)
print("predictedY: " + str(predictedY))

3.实验结果

"D:\Program Files\Python\Anaconda\python.exe" E:/Python/machinelearning/01.py
['\ufeffRID', 'age', 'Income', 'student', 'credit_rating', 'Class_buys_computer']
dummyX:[[ 1.  0.  0.  0.  0.  1.  0.  1.  1.  0.]
 [ 1.  0.  0.  0.  0.  1.  1.  0.  1.  0.]
 [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.]
 [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.]
 [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.  1.  0.  1.  1.  0.]
 [ 0.  1.  0.  0.  0.  1.  0.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  0.  0.  1.  1.  0.  0.  1.]
 [ 0.  0.  1.  1.  0.  0.  1.  0.  1.  0.]
 [ 1.  0.  0.  1.  0.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  1.  0.  1.  0.]]
['Income=high', 'Income=low', 'Income=medium', 'age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'student=no', 'student=yes']
labelList:['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
dummyY:[[0]

 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]]
clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
oneRowX: [ 1.  0.  0.  0.  0.  1.  0.  1.  1.  0.]
newRowX: [ 1.  0.  1.  0.  0.  1.  0.  1.  1.  0.]
predictedY: [0]
D:\Program Files\Python\Anaconda\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

4.将dot文件转化为pdf输出（命令为：dot -Tpdf E:\MachineLearning-data\AllElectronics.dot -o  E:\MachineLearning-data\AllElectronics.pdf）

其中将dot转化为pdf的软件graphviz在9中进行详述；

5.错误总结

1..错误1
python读取文件时提示"UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 205: illegal multibyte sequence"

解决办法1.
FILE_OBJECT= open('order.log','r', encoding='UTF-8')
解决办法2.
FILE_OBJECT= open('order.log','rb')
2..错误2
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
原因：循环的数据不应该是二进制数据
open('E:\MachineLearning-data\AllElectronics.csv', 'rb',encoding="utf-8")
解决方案：
open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8")
说明：rb：以二进制格式打开一个文件用于只读
rt：读文件，python在读取文本时会自动把\r\n转换成\n
3..错误3
.csv文件编码必须与读写时的编码格式相符合；

6.安装graphviz

1）下载：第一个为安装版，第二个为免安装版

决策树（实践）

2）安装配置环境变量

a.配置环境变量(系统变量PATH中添加)

决策树（实践）

b.检测是否安装正确

决策树（实践）