【发布时间】:2019-03-07 11:45:01
【问题描述】:
我正在使用 Python 制作和实验机器学习,事情是我想在我的实验中添加精确度量和混淆矩阵,我的完整代码如下所示:
print('Random Forest Testing')
from sklearn import svm
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import csv
from sklearn import preprocessing
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
打开 csv:
f = open('Telcel_facebook_comments_train.csv')
csv_f = csv.reader(f)
创建矢量化器 tfidf:
vectorizer = TfidfVectorizer(analyzer='char',ngram_range=(1, 3))
保存 cmets 和标签的列表:
list_comments=[]
list_tags=[]
for row in csv_f:
list_comments.append(row[0])
list_tags.append(row[1])
X = vectorizer.fit_transform(list_comments)
print(X)
vectorizadorEtiquetas= preprocessing.LabelEncoder()
Y=vectorizadorEtiquetas.fit_transform(list_tags)
print(Y)
获取功能的名称:
tfidf_words=vectorizer.get_feature_names()
clf = svm.SVR()
#Second Machine learning algorithm
clf2 = RandomForestClassifier(n_estimators=10)
clf2 = clf2.fit(X, Y)
#building X train and Y train matrix
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.33, random_state=47)
print('Starting training')
#clf.fit(X_train, y_train)
clf2.fit(X_train, y_train)
print('Training Completed')
print(clf2.score(X_test, y_test))
导入混淆矩阵并召回
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
这里是当我需要添加精度和混淆矩阵时,下面的代码是错误的,因为我不知道如何获得名为“y_true”的矩阵,我只有三个类:1,2,3
print(precision_recall_fscore_support(y_true, y_pred, average='macro'))
print(confusion_matrix(y_true, y_pred))
为了更清楚,这是输出的一部分:
Random Forest Testing
(0, 2128) 0.225797583675
(0, 6205) 0.243191128615
(0, 6366) 0.21798642306
(0, 3292) 0.204253719304
(0, 4763) 0.161726027808
(0, 1950) 0.264734992986
(0, 6457) 0.264734992986
(0, 5153) 0.264734992986
(0, 3216) 0.105568550619
(0, 4760) 0.128342578419
[3 1 1 ..., 2 2 2]
Starting training
Training Completed
0.881481481481
但是,我想感谢支持显示混淆矩阵和召回指标以了解我的模型的更多信息,感谢您的支持。
这是我实现结果的第二次努力,现在我尝试代替上面的行:
y_pred = clf2.predict(X_test)
print('Training Completed')
'''
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you
require for each samplethat each label set be correctly predicted.
'''
print(clf2.score(X_test, y_test))
#importing Confusion Matrix and recall
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
#Here is when I need to add the precision and confusion matrix
print(precision_recall_fscore_support(y_test, y_pred, average='macro'))
print(confusion_matrix(y_test, y_pred))
这是输出:
(0.68431620945676808, 0.61034292763991205, 0.63832235955391514, None)
[[159 83 7 0]
[ 3 811 6 0]
[ 5 22 118 0]
[ 0 1 0 0]]
C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
现在的问题是我得到了一个 4x4 的混淆矩阵,而我只有三个标签,所以我想在这里获得支持,
【问题讨论】:
标签: python scikit-learn jupyter random-forest