为什么 texts_to_sequences() 输出为空数组？答案

【问题标题】：Why does texts_to_sequences() output as an empty array?为什么 texts_to_sequences() 输出为空数组？
【发布时间】：2020-03-06 02:05:17
【问题描述】：

我正在尝试使用预先训练的模型进行预测。

并且 texts_to_sequences(twt) 返回和空数组。因此，预测总是负面的。对于所有输入。

from keras.preprocessing.sequence import pad_sequences
twt=['happy']
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
print(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=50, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
print(sentiment)
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")

输出：

[[]]
[0.89889544 0.10110457]
negative

如何解决这个问题？

【问题讨论】：

pre-fitter 分词器实例是否有任何包含单词 happy 的输入？换句话说，如果你运行t.word_index.get("happy", "word not in the vocabulary")，你会得到什么？
是的，它包括在内。我也检查了数据集
好（：t.num_words 和 t.oov_token 得到什么值？如果 t.oov_token 是 None 和 t.num_words t.word_index.get("happy", "word not in the vocabulary") 这将解释空数组。如果在这种情况下，两种可能的解决方案是在将标记器拟合到文本之前增加 t.num_words 和/或设置 oov_token = "<OOV>"。

标签： python tensorflow keras lstm recurrent-neural-network

【解决方案1】：

我遇到了同样的问题，您只需在 tokenizer.fit_on_texts 和 tokenizer.texts_to_sequences 这两个函数中传递列表。

示例： tokenizer.fit_on_texts([test_word])

model = ks.models.load_model('trained')
tokenizer = Tokenizer(num_words=5000)
test_word ="This is soo cool"
tokenizer.fit_on_texts([test_word])
tw = tokenizer.texts_to_sequences([test_word])
print('tw: ', tw)

【讨论】：