使用 Keras 处理看不见的类答案

【问题标题】：Handling unseen classes with Keras使用 Keras 处理看不见的类
【发布时间】：2021-07-03 04:16:52
【问题描述】：

我用 Python 制作了一个 Keras 模型，它可以对字符串输入是公司、个人还是地址进行分类。模型在 12000 个字符串数据上进行训练。每个输入有 1 到 5 个单词。这是模型：

transformerVectoriser = ColumnTransformer(transformers=[('vector char', CountVectorizer(analyzer='char', ngram_range=(3, 6), max_features = 2000), 'text'),
                                                        ('vector word', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 4000), 'text')],
                                          remainder='passthrough') # Default is to drop untransformed columns


features = transformerVectoriser.fit_transform(features)


model = Sequential()
model.add(Dense(100, input_dim = features.shape[1], activation = 'relu')) # input layer requires input_dim param
model.add(Dense(200, activation = 'relu'))
model.add(Dense(100, activation = 'relu'))
model.add(Dense(50, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))

这些是结果：

                precision    recall  f1-score   support

company         0.97         0.92      0.95       636
person          0.93         0.97      0.95       697
address         1.00         1.00      1.00       667

accuracy                               0.97      2000
macro avg       0.97         0.96      0.97      2000
weighted avg    0.97         0.97      0.97      2000

例如，如果我想使用字符串输入进行预测：

input_strs = ['Amazon Inc', 'Jeff Bezos', 'Elon Musk', '24 Avenue Paris']

将其分类为：

 ['company', 'person', 'person', 'address']

该模型运行良好，但我注意到有时如果我输入一个字符串，例如，代表电话号码或只是一些随机数字或一些随机字符串，它会犯很大的错误。例如，如果我输入：

['+435 542 425 54 24', '426266245', 'as long as the']

我得到了结果：

 ['address', 'company', 'address']

我的问题是，我该如何处理一些看不见的课程？如果字符串输入不满足一些可以正确分类的基本“形式”，我该如何处理？

【问题讨论】：

标签： python machine-learning keras

【解决方案1】：

我建议您创建一个名为“hmmm...”之类的类别。并且 - 用大量不属于您感兴趣的类的字符串填充此类别。

很容易制作一个小脚本，它会在互联网或书籍上阅读一段时间，每次它找到一个不是公司、地址或人的字符串，然后将其保存到“hmmm...”类别。

因此，您有一个 DNN，它将每个“奇怪”输入分类到类“hmmm....”

您还可以找到其他解决方案，但这是解决问题的一种方法。

【讨论】：

我正在考虑这个问题，但我认为这不是一个好的解决方案。我需要收集日期、电话号码、随机数和许多不同的字符串并将它们放在一个类中，必须有更好的解决方案
我将为您提供替代解决方案，作为独立答案。希望这对你有帮助！

【解决方案2】：

另一个 - 更直接但我认为从长远来看不太准确的解决方案 - 是通过以下方式在 softmax 之后添加一些简单的逻辑：

import numpy as np

#Initialize
softmaxoutput=np.double([1,2,3])
classes=['company','person','address']

#Let's play a littlebit with the outputs
softmaxoutput[0]=0.3
softmaxoutput[1]=0.3
softmaxoutput[2]=1-(softmaxoutput[0]+softmaxoutput[1])

#Let's decide the predicted class...
result=np.argmax(softmaxoutput)
predicted_class=classes[result]

uncertainity_threshold=0.5

#...but make an exception that...
if np.amax(softmaxoutput)<=uncertainity_threshold:
    predicted_class='hmmm...'

#And finally let's show the result
print(predicted_class)

...您可以通过参数不确定性阈值轻松管理此附加逻辑的“效果”。如果您将此值设为 1，您肯定会得到与当前解决方案相同的结果……但通过减小此值，您对非逻辑分类的头痛将略有缓解。 “手动”测试似乎是最佳值很简单。

您还可以找到其他解决方案，但这是解决该问题的第二种方法。

【讨论】：