决策树实验

1.准备数据(E:\MachineLearning-data\AllElectronics.csv)

RID age Income student credit_rating Class_buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no


2.实验代码

# -*- coding: utf-8 -*-
# coding=utf-8

# 实现决策树并进行预测
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import preprocessing
from sklearn import tree

#1.读取数据,rt模式下,python在读取文本时会自动把\r\n转换成\n.,设置编码格式与文档统一
allElectronicsData = open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8")
reader = csv.reader(allElectronicsData)
headers = next(reader)
#读出数据的属性名
print(headers)

#2.存放数据
#featuresList:将属性:age、 Income、student、 credit_rating、的值存放在列表中,
#labelList:分类的结果存放在列表
featuresList = []
labelList = []

for row in reader:
    labelList.append(row[len(row) - 1])
    rowDict = {}
    for i in range(1, len(row) - 1):
        rowDict[headers[i]] = row[i]
    featuresList.append(rowDict)

#3.将数据向量化
vec = DictVectorizer()
dummyX = vec.fit_transform(featuresList).toarray()
print("dummyX:" + str(dummyX))
#输出属性的类别
print(vec.get_feature_names())
#输出训练集分类结果
print("labelList:" + str(labelList))
#4.将训练集结果进行数据化处理
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY:" + str(dummyY))
#5.属性设置结束,设置决策树构造参数
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)
print("clf:" + str(clf))
#6.将结果写入文件中
with open("E:\MachineLearning-data\AllElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
#7.给定数据,进行预测,读出第一条数据(一行)
oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))
#修改数据中的值
newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 1
print("newRowX: " + str(newRowX))
#8.给出预测结果
predictedY = clf.predict(newRowX)
print("predictedY: " + str(predictedY))
3.实验结果
"D:\Program Files\Python\Anaconda\python.exe" E:/Python/machinelearning/01.py
['\ufeffRID', 'age', 'Income', 'student', 'credit_rating', 'Class_buys_computer']
dummyX:[[ 1.  0.  0.  0.  0.  1.  0.  1.  1.  0.]
 [ 1.  0.  0.  0.  0.  1.  1.  0.  1.  0.]
 [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.]
 [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.]
 [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.  1.  0.  1.  1.  0.]
 [ 0.  1.  0.  0.  0.  1.  0.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  0.  0.  1.  1.  0.  0.  1.]
 [ 0.  0.  1.  1.  0.  0.  1.  0.  1.  0.]
 [ 1.  0.  0.  1.  0.  0.  0.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.  1.  0.  1.  0.]]
['Income=high', 'Income=low', 'Income=medium', 'age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'student=no', 'student=yes']
labelList:['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
dummyY:[[0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]]
clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
oneRowX: [ 1.  0.  0.  0.  0.  1.  0.  1.  1.  0.]
newRowX: [ 1.  0.  1.  0.  0.  1.  0.  1.  1.  0.]
predictedY: [0]
D:\Program Files\Python\Anaconda\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
4.将dot文件转化为pdf输出(命令为:dot -Tpdf E:\MachineLearning-data\AllElectronics.dot -o  E:\MachineLearning-data\AllElectronics.pdf)
其中将dot转化为pdf的软件graphviz在9中进行详述;


5.错误总结

1..错误1
python读取文件时提示"UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 205: illegal multibyte sequence"

解决办法1.
FILE_OBJECT= open('order.log','r', encoding='UTF-8')
解决办法2.
FILE_OBJECT= open('order.log','rb')
2..错误2
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
原因:循环的数据不应该是二进制数据
open('E:\MachineLearning-data\AllElectronics.csv', 'rb',encoding="utf-8")
解决方案:
open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8")
说明:rb:以二进制格式打开一个文件用于只读
rt:读文件,python在读取文本时会自动把\r\n转换成\n
3..错误3
.csv文件编码必须与读写时的编码格式相符合;

6.安装graphviz

1)下载:第一个为安装版,第二个为免安装版

决策树(实践)

2)安装配置环境变量

a.配置环境变量(系统变量PATH中添加)

决策树(实践)

b.检测是否安装正确

决策树(实践)



相关文章:

  • 2021-04-26
  • 2021-06-27
  • 2021-08-18
  • 2021-11-08
  • 2021-10-15
  • 2021-06-29
  • 2021-11-14
猜你喜欢
  • 2021-08-15
  • 2021-08-01
  • 2021-12-26
  • 2021-10-19
  • 2018-06-23
  • 2021-07-14
  • 2022-01-05
相关资源
相似解决方案