使用 Sentiment140 数据的 Tensorflow hub-NNLM 词嵌入给出了输入形状错误答案

【问题标题】：Tensorflow hub-NNLM word embedding using sentiment140 data gives input shape error使用 Sentiment140 数据的 Tensorflow hub-NNLM 词嵌入给出了输入形状错误
【发布时间】：2021-09-22 13:51:42
【问题描述】：

我正在使用 tensorflow hub“https://tfhub.dev/google/nnlm-en-dim128/2”词嵌入对 Kaggle“sentiment140”数据集进行情感分析。

数据集：Kaggle("sentiment140") https://www.kaggle.com/kazanova/sentiment140 TensorFlow-Hub：https://tfhub.dev/google/nnlm-en-dim128/2

我在这里使用 keras 顺序层，当我拟合它给出值错误的模型时

ValueError: Python inputs incompatible with input_signature:
      inputs: (
        Tensor("IteratorGetNext:0", shape=(None, 128), dtype=float32))
      input_signature: (
        TensorSpec(shape=(None,), dtype=tf.string, name=None))

我的代码：

    import pandas as pd
import tensorflow as tf
from sklearn.model_selection import  train_test_split
import seaborn as sns
import tensorflow_hub as hub
from tensorflow.keras import Sequential
import keras

tweet_df = pd.read_csv("training.1600000.processed.noemoticon.csv", names=['polarity', 'id', 'date', 'query', 'user', 'text'],encoding='latin-1')

tweet_df.info()

tweet_df.head()

"""#### 2.) Data Visualization"""

tweet_df['polarity'] = tweet_df['polarity'].replace(to_replace=4,value=1)

### Print two movies reviews from each class

print("Movie Review Polarity Negative class 0 :\n", tweet_df[tweet_df['polarity']==0]['text'].head(2) )

print("\n\nMovie Review Polarity Positive class 1 :\n", tweet_df['text'][tweet_df['polarity']==1].head(2) )

class_dist = tweet_df['polarity'].value_counts().rename_axis('Class Label').reset_index(name='Tweets')
#class_dist = class_dist['Class Label'].replace({0:'Negative',1:'Positve'})
class_dist

## Bar graph of Distribution of Classes
class_dist['class'] = ['Positive','Negative']
sns.set_theme(style='whitegrid')
sns.barplot(x='Class Label', y='Tweets', hue='class', data= class_dist)

### Train and test split 
X = tweet_df.iloc[:,5]
y = tweet_df.iloc[:,0]
X_train, X_test,y_train, y_test = train_test_split(X,y,random_state=5, test_size=0.2)

print("Training shape of X and y : ", X_train.shape ,y_train.shape)
print("Testing shape of X and y : ", X_test.shape ,y_test.shape)

"""#### 3.) Data Pre-processing"""

embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
X_train_embed = embed(X_train)

y_train = tf.keras.utils.to_categorical(y_train,2)

X_train_embed.shape


X_sample = X_train_embed[:1000]
y_sample = y_train[:1000]
y_sample = tf.keras.utils.to_categorical(y_sample,2)


"""#### 4.) Model Building"""

hub_layer = hub.KerasLayer('https://tfhub.dev/google/nnlm-en-dim128/2',input_shape=[],dtype=tf.string,trainable=False)

model = Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(128, 'relu', name ='layer_1'))
model.add(keras.layers.Dense(64, 'relu', name = 'layer_2'))
model.add(keras.layers.Dense(2, activation='sigmoid', name='output'))

model.compile(optimizer='adam',loss= 'BinaryCrossentropy',  #'categorical_crossentropy' ,
              metrics=['accuracy'] )

NN_model = model.fit(X_sample, y_sample, epochs=20, validation_split=0.1, verbose=1)

输入形状：

X_sample.shape

TensorShape([1000, 128])

y_sample.shape

(1000, 2, 2)

X_sample

<tf.Tensor: shape=(1000, 128), dtype=float32, numpy=
array([[ 0.10381411,  0.07044576, -0.0282673 , ...,  0.08205549,
0.15822364, -0.10019408],
[-0.03332436, -0.00529242,  0.20348714, ..., -0.14174528,
0.05178985, -0.12599435],
[ 0.2461916 , -0.03084931,  0.05861813, ...,  0.07956063,
-0.03579932,  0.07493019],
[ 0.4102695 ,  0.15445013,  0.19045362, ...,  0.12681636,
0.12362286, -0.03969387],
[-0.0144283 , -0.05236297,  0.04851832, ...,  0.05562773,
0.01529189,  0.12605236],
[ 0.29280087,  0.05795274, -0.11779188, ..., -0.01890504,
0.02824693, -0.13629636]], dtype=float32)>

【问题讨论】：

X_sample 是如何创建的？看起来好像您正在尝试将浮点矩阵输入到需要字符串向量的模型中。请提供可重现的最小 sn-p 以简化我们的调试。
是的，我已经更新了 X_sample。谢谢

标签： keras sentiment-analysis word-embedding tensorflow-hub language-model

【解决方案1】：

如https://tfhub.dev/google/nnlm-en-dim128/2 所述，该模型需要一个字符串向量作为输入。自从你执行以来，你基本上调用了模型两次

embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
X_train_embed = embed(X_train)  # (n, 128) float matrix

然后将该嵌入传递给model，它实际上将字符串作为输入，因为它以 NNLM KerasLayer 开头。

我建议删除 embed 和 X_train_embed，然后用 X_train 调用 model.fit：

model.fit(np.array(["Lyx is cool", "Lyx is not cool"]), np.array([1, 0]), epochs=20, validation_split=0.1, verbose=1)

【讨论】：

嗨.. 现在发生了两件事。 1.）如果我将 X_train 作为字符串直接传递给 model.fit，那么在每次迭代（Epoch）中，都会使用 36000 个数据点而不是原始的 1280000 个数据点进行训练，这里的验证拆分仅为 10%。它如何在每个时期仅使用 36000 进行训练 2.) 如果将输入字符串 X_train 转换为 np.array 然后 model.fit 抛出 ValueError: Python inputs incompatible with input_signature: inputs: ( Tensor("ExpandDims:0", shape=(32 , 1), dtype=string)) input_signature: (TensorSpec(shape=(None,), dtype=tf.string, name=None))