【发布时间】:2019-06-16 12:50:00
【问题描述】:
目前,我有一个数据集,其中包含两列程序名称及其 CPT。例如,全膝关节置换术-27447、全髋关节置换术-27130、开放式腕管释放-64721。该数据集有 3000 行,共有 5 个 CPT 代码(5 个类别)。我正在编写一个分类模型。当我传递一些错误的输入时,例如,“开放式膝关节置换腕管释放”,它给出的输出 64721 是错误的。下面是我正在使用的代码。我可以知道我可以对我的代码进行哪些更改,以及为这个问题选择神经网络是否正确?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neural_network import MLPClassifier
xl = pd.ExcelFile("dataset.xlsx") # reading the data
df = xl.parse('Query 2.2')
# shuffling the data
df=df.sample(frac=1)
X_train, X_test, y_train, y_test = train_test_split(df['procedure'], df['code'], random_state = 0,test_size=0.10)
count_vect = CountVectorizer().fit(X_train)
X_train_counts = count_vect.transform(X_train)
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
model= MLPClassifier(hidden_layer_sizes=(25),max_iter=500)
classificationModel=model.fit(X_train_tfidf, y_train)
data_to_be_predicted="open knee arthroplasty carpal tunnel release"
result = classificationModel.predict(count_vect.transform([data_to_be_predicted]))
predictionProbablityMatrix = classificationModel.predict_proba(count_vect.transform([data_to_be_predicted]))
maximumPredictedValue = np.amax(predictionProbablityMatrix)
if maximumPredictedValue * 100 > 99:
print(result[0])
else:
print("00000")
【问题讨论】:
-
是的,你是对的,例如,当我通过开放式膝关节置换腕管松解术时,它应该给出“00000”,但它给出的是“64721”,这是错误的。
-
我没有得到你,predictionsProbablityMatrix = classificationModel.predict_proba(count_vect.transform([data_to_be_predicted])) 会给我一个包含 5 个值的数组,因为我有 5 个类,然后我使用 np.amax (predictionProbablityMatrix) 选择概率最高的类
标签: python machine-learning neural-network deep-learning data-extraction