你如何将文本传递给 tensorflow 模型以返回预测答案

【问题标题】：how do you pass text to tensorflow model to return prediction你如何将文本传递给 tensorflow 模型以返回预测
【发布时间】：2021-08-01 19:42:20
【问题描述】：

tf/python 的新手，并创建了一个模型，该模型将文本分类为毒性级别（淫秽、有毒、威胁等）。这是我到目前为止所拥有的，它确实产生了摘要，所以我知道它正在正确加载。如何将文本传递给模型以返回预测？任何帮助将不胜感激。

import os
import numpy as np
import tensorflow as tf
from tensorflow import keras

checkpoint_path = "tf_model/the_model/saved_model.pb"
checkpoint_dir = os.path.dirname(checkpoint_path)

new_model = tf.keras.models.load_model(checkpoint_dir)

# Check its architecture
new_model.summary()

inputs = [
    "tenserflow seems like it fits the bill but there are zero tutorials that outline how to reuse a model in a production environment "
]

predictions = new_model.predict(inputs)
print(predictions)

我收到很多错误消息，其中一些啰嗦如下：

警告：tensorflow：模型是用形状 (None, 150) 构建的输入 KerasTensor(type_spec=TensorSpec(shape=(None, 150), dtype=tf.float32, name='input_1'), name='input_1 ', description="created by layer 'input_1'")，但它是在形状不兼容的输入上调用的 (None, 1)。

ValueError: '{{node model/conv1d/conv1d}} = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1] 从 1 中减去 3 导致的负维度大小, explicit_paddings=[], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true](model/conv1d/conv1d/ExpandDims, model/conv1d/conv1d/ExpandDims_1)' 输入形状： [?,1,1,256], [1,3,256,64]。

这是用于创建和测试它/预测的 py 代码，效果很好：

import tensorflow as tf
import numpy as np
import pandas as pd

import os

TRAIN_DATA = "datasets/train.csv"
GLOVE_EMBEDDING = "embedding/glove.6B.100d.txt"

train = pd.read_csv(TRAIN_DATA)

train["comment_text"].fillna("fillna")

x_train = train["comment_text"].str.lower()
y_train = train[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values

max_words = 100000
max_len = 150

embed_size = 100

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words, lower=True)

tokenizer.fit_on_texts(x_train)

x_train = tokenizer.texts_to_sequences(x_train)

x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_len)

embeddings_index = {}

with open(GLOVE_EMBEDDING, encoding='utf8') as f:
    for line in f:
        values = line.rstrip().rsplit(' ')
        word = values[0]
        embed = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embed

word_index = tokenizer.word_index

num_words = min(max_words, len(word_index) + 1)

embedding_matrix = np.zeros((num_words, embed_size), dtype='float32')

for word, i in word_index.items():

    if i >= max_words:
        continue

    embedding_vector = embeddings_index.get(word)

    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

input = tf.keras.layers.Input(shape=(max_len,))

x = tf.keras.layers.Embedding(max_words, embed_size, weights=[embedding_matrix], trainable=False)(input)

x = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(128, return_sequences=True, dropout=0.1,
                                                      recurrent_dropout=0.1))(x)

x = tf.keras.layers.Conv1D(64, kernel_size=3, padding="valid", kernel_initializer="glorot_uniform")(x)

avg_pool = tf.keras.layers.GlobalAveragePooling1D()(x)
max_pool = tf.keras.layers.GlobalMaxPooling1D()(x)

x = tf.keras.layers.concatenate([avg_pool, max_pool])

preds = tf.keras.layers.Dense(6, activation="sigmoid")(x)

model = tf.keras.Model(input, preds)

model.summary()

model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(lr=1e-3), metrics=['accuracy'])

batch_size = 128

checkpoint_path = "tf_model/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=5, monitor='val_loss'),
    tf.keras.callbacks.TensorBoard(log_dir='logs'),
    cp_callback
]

model.fit(x_train, y_train, validation_split=0.2, batch_size=batch_size,
          epochs=1, callbacks=callbacks, verbose=1)

latest = tf.train.latest_checkpoint(checkpoint_dir)

model.load_weights(latest)

# Save the entire model as a SavedModel.
model.save('tf_model/the_model')

predictions = model.predict(np.expand_dims(x_train[42], 0))
print(tokenizer.sequences_to_texts([x_train[42]]))
print(y_train[42])
print(predictions)

最终解决方案：

import os
import numpy as np
import tensorflow as tf
from tensorflow import keras

checkpoint_path = "tf_model/the_model/saved_model.pb"
checkpoint_dir = os.path.dirname(checkpoint_path)
new_model = tf.keras.models.load_model(checkpoint_dir)

max_words = 100000
max_len = 150

# Check its architecture
# new_model.summary()

inputs = ["tenserflow seems like it fits the bill but there are zero tutorials that outline how to reuse a model in a production environment."]

# use same tokenizer used to build model
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words, lower=True)
tokenizer.fit_on_texts(inputs)

# pass string to tokenizer and that 'array' is passed to predict
sequence = tokenizer.texts_to_sequences(inputs) # same tokenizer which is used on train data.
sequence = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen = max_len)
predictions = new_model.predict(sequence)
print(predictions)
# [[0.0365479  0.01275077 0.02102855 0.00647011 0.02302513 0.00406089]]

【问题讨论】：

你只需要像训练集一样预处理你的文本。标记化、填充等。
不确定我是否关注。我通过传入文本来测试构建，它做出了预测。为什么在使用 load_model 构建后它不起作用？是否有显示如何做到这一点的端到端教程？还没找到。
我不确定您的模型包括哪些层，但是，如果您对它进行与训练数据相同的预处理，它应该可以工作。据我所知，测试数据的预处理存在问题。我的意思是，如果您使用Tokenizer，那么您应该使用该标记器来标记测试数据。
我已将代码添加到创建模型的问题中。不确定如何标记字符串或者是否真的需要重用模型？
在最终解决方案中添加了 tokenizer.fit_on_texts(inputs)，现在不同的字符串返回不同的结果。

标签： python python-3.x tensorflow tensorflow2.0 tf.keras

【解决方案1】：

需要以同样的方式处理。这可以通过以下方式完成：

inputs = [
"tenserflow seems like it fits the bill but there are zero tutorials that outline 
 how to reuse a model in a production environment"]

sequence = tokenizer.texts_to_sequences(inputs) # same tokenizer which is used on train data.
sequence = pad_sequences(sequence, maxlen = max_len)

predictions = new_model.predict(sequence)

【讨论】：

谢谢！稍作修改（请参阅原始问题中的最终解决方案）。
为了更清楚一点，我所说的标记器的意思是，它应该是适合您的火车数据的 same 标记器实例。因为它包含您数据的唯一单词索引。在您的编辑中，您创建了一个新实例。
你知道为什么我改变输入后结果还是一样的吗？
我怀疑分词器有问题。你保存吗？检查这个：stackoverflow.com/a/45737582/13726668