【发布时间】:2020-03-26 14:25:27
【问题描述】:
我正在做一个项目,我试图将 cmets 分类为各种类别:“有毒”、“严重有毒”、“淫秽”、“侮辱”、“身份仇恨”。我使用的数据集来自这个 Kaggle 挑战:https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge。我当前面临的问题是,无论我将数据放在多么小的训练数据集上,当我预测测试数据的标签时,我的准确率始终在 90% 左右或以上。在这种情况下,我正在对 15 行数据进行训练并在 159,556 行上进行测试。我通常会很高兴拥有高测试准确度,但在这种情况下,我觉得我做错了什么。
我正在将数据读入 pandas 数据框:
trainData = pd.read_csv('train.csv')
打印出来的数据如下所示:
id comment_text \
0 0000997932d777bf Explanation\nWhy the edits made under my usern...
1 000103f0d9cfb60f D'aww! He matches this background colour I'm s...
2 000113f07ec002fd Hey man, I'm really not trying to edit war. It...
3 0001b41b1c6bb37e "\nMore\nI can't make any real suggestions on ...
4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember...
... ... ...
159566 ffe987279560d7ff ":::::And for the second time of asking, when ...
159567 ffea4adeee384e90 You should be ashamed of yourself \n\nThat is ...
159568 ffee36eab5c267c9 Spitzer \n\nUmm, theres no actual article for ...
159569 fff125370e4aaaf3 And it looks like it was actually you who put ...
159570 fff46fc426af1f9a "\nAnd ... I really don't think you understand...
toxic severe_toxic obscene threat insult identity_hate
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
... ... ... ... ... ... ...
159566 0 0 0 0 0 0
159567 0 0 0 0 0 0
159568 0 0 0 0 0 0
159569 0 0 0 0 0 0
159570 0 0 0 0 0 0
[159571 rows x 8 columns]
然后我使用 train_test_split 将数据拆分为训练和测试:
X = trainData.drop(labels= ['id','toxic','severe_toxic','obscene','threat','insult','identity_hate'],axis=1)
Y = trainData.drop(labels = ['id','comment_text'],axis=1)
trainX,testX,trainY,testY = train_test_split(X,Y,test_size=0.9999,random_state=99)
我正在使用 sklearn 的 HashingVectorizer 将 cmets 转换为数值向量进行分类:
def hashVec():
trainComments=[]
testComments=[]
for index,row in trainX.iterrows():
trainComments.append(row['comment_text'])
for index,row in testX.iterrows():
testComments.append(row['comment_text'])
vectorizer = HashingVectorizer()
trainSamples = vectorizer.transform(trainComments)
testSamples = vectorizer.transform(testComments)
return trainSamples,testSamples
我正在使用来自 sklearn 的 OneVsRestClassifier 和 LogisticRegression 来拟合和预测 6 个类中的每一个的数据
def logRegOVR(trainSamples,testSamples):
commentTypes=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
clf = OneVsRestClassifier(LogisticRegression(solver='sag'))
for cType in commentTypes:
print(cType,":")
clf.fit(trainSamples,trainY[cType])
pred1 = clf.predict(trainSamples)
print("\tTrain Accuracy:",accuracy_score(trainY[cType],pred1))
prediction = clf.predict(testSamples)
print("\tTest Accuracy:",accuracy_score(testY[cType],prediction))
最后,这里是我调用函数的地方,以及我得到的输出:
sol = hashVec()
logRegOVR(sol[0],sol[1])
toxic :
Train Accuracy: 0.8666666666666667
Test Accuracy: 0.9041590413397177
severe_toxic :
Train Accuracy: 1.0
Test Accuracy: 0.9900035097395272
obscene :
Train Accuracy: 1.0
Test Accuracy: 0.9470468048835519
threat :
Train Accuracy: 1.0
Test Accuracy: 0.9970041866178646
insult :
Train Accuracy: 1.0
Test Accuracy: 0.9506317531148938
identity_hate :
Train Accuracy: 1.0
Test Accuracy: 0.9911943142219659
当我有一个更合理的 80% 训练和 20% 测试的 train_test_split 时,测试准确度非常相似。
感谢您的帮助
【问题讨论】:
标签: python machine-learning logistic-regression