朴素贝叶斯
查看例子:
----------------------------------------------------------------------------------------------------------------------------------------------------
用p1(x, y)表示(x, y)属于类别1的概率,P2(x, y)表示(x, y)属于类别2的概率;
如果p(c1|x, y) > P(c2|x, y), 那么类别为1
如果p(c1|x, y) < P2(c2|x, y), 那么类别为2
根据贝叶斯公式:
p(c|x, y) = (p(x, y|c) * p(c)) / p(x, y)
(x, y)表示要分类的特征向量, c表示类别
因为p(x, y),对不同类别的数值是一样的,只需计算p(x, y|c) 和 p(c)
p(c)根据样本数据的类别,容易计算出来
p(x, y|c), 需要先计算每个类别下训练样本的特征出现的概率
根据测试样本,计算特征向量,再计算与训练好的特征概率的点积,即可。
样本数据中,每个类别中每个项目在总的词典中出现的概率。
1 'hello word' 0
2 'this is your problem' 0
3 'dont do is that' 1
一共3个项目,2个类别
词典是['hello', 'word', 'this', 'is', 'your', 'problem', 'dont', 'do', that], 注意去掉重复的单词
项目1的特征向量是【1, 1, 0, 0, 0, 0, 0, 0, 0】 即为特征向量p(x,y)
项目2的特征向量是【0, 0, 1, 1, 1, 1, 0, 0, 0】
项目3的特征向量是【0, 0, 0, 1, 0, 0,1, 1, 1】
类别0的特征向量
项目1 + 项目2 = 【1, 1, 1, 1, 1, 1, 0, 0,0】
sum(项目1) +sum(项目2) =6
p(x, y|c0) = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6, 0, 0,0]
p(c0) = 2/3
类别1的特征向量
p(x, y|c1) = 【0, 0, 0, 1/4, 0, 0,1/4, 1/4, 1/4】
p(c1) = 1/3
注意实际计算的时候,
p(c|x,y) = p(x, y|c) * p(x,y) + log(p(c))
实例, 垃圾邮件过滤
观察代码:
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Sat May 02 21:52:08 2015 4 5 @author: silingxiao 6 """ 7 from numpy import * 8 9 def loadDataSet(): 10 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], 11 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], 12 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], 13 ['stop', 'posting', 'stupid', 'worthless', 'garbage'], 14 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], 15 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] 16 classVec = [0,1,0,1,0,1] #1 is abusive, 0 not 17 return postingList,classVec 18 19 def createVocabList(dataSet): 20 vocabSet = set([]) #create empty set 21 for document in dataSet: 22 vocabSet = vocabSet | set(document) #union of the two sets 23 return list(vocabSet) 24 25 def setOfWords2Vec(vocabList, inputSet): 26 returnVec = [0]*len(vocabList) 27 for word in inputSet: 28 if word in vocabList: 29 returnVec[vocabList.index(word)] = 1 30 else: print "the word: %s is not in my Vocabulary!" % word 31 return returnVec 32 33 def trainNB0(trainMatrix,trainCategory): 34 numTrainDocs = len(trainMatrix) 35 numWords = len(trainMatrix[0]) 36 pAbusive = sum(trainCategory)/float(numTrainDocs) 37 p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() 38 p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 39 for i in range(numTrainDocs): 40 if trainCategory[i] == 1: 41 p1Num += trainMatrix[i] 42 p1Denom += sum(trainMatrix[i]) 43 else: 44 p0Num += trainMatrix[i] 45 p0Denom += sum(trainMatrix[i]) 46 p1Vect = log(p1Num/p1Denom) #change to log() 47 p0Vect = log(p0Num/p0Denom) #change to log() 48 return p0Vect,p1Vect,pAbusive 49 50 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): 51 p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult 52 p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) 53 if p1 > p0: 54 return 1 55 else: 56 return 0 57 58 59 def testingNB(): 60 listOPosts,listClasses = loadDataSet() 61 myVocabList = createVocabList(listOPosts) 62 trainMat=[] 63 for postinDoc in listOPosts: 64 trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) 65 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) 66 testEntry = ['love', 'my', 'dalmation'] 67 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) 68 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) 69 testEntry = ['stupid', 'garbage'] 70 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) 71 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) 72 73 postingList, classVec = loadDataSet() 74 vocabSet = createVocabList(postingList) 75 trainMat = [] 76 for postinDoc in postingList: 77 trainMat.append(setOfWords2Vec(vocabSet, postinDoc)) 78 79 p0v, p1v, pAb = trainNB0(trainMat, classVec) 80 81 82 testingNB()