【发布时间】:2021-12-24 15:02:53
【问题描述】:
我正在尝试构建用于文本生成的 LSTM 模型,但在尝试拟合模型时出现错误。
追溯:
> InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Cannot batch tensors with different shapes in component 0. First element had shape [21] and element 1 had shape [17]. [[node IteratorGetNext (defined at tmp/ipykernel_7804/4234150290.py:1) ]] (1) Invalid argument: Cannot batch tensors with different shapes in component 0. First element had shape [21] and element 1 had shape [17]. [[node IteratorGetNext (defined at tmp/ipykernel_7804/4234150290.py:1) ]] [[IteratorGetNext/_4]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_35783]
代码:
batch_size = 64
AUTOTUNE = tf.data.experimental.AUTOTUNE
buffer_size= train_ds.cardinality().numpy()
train_ds = train_ds.shuffle(buffer_size=buffer_size)\
.batch(batch_size=batch_size,drop_remainder=True)\
.cache()\
.prefetch(AUTOTUNE)
test_ds = test_ds.shuffle(buffer_size=buffer_size)\
.batch(batch_size=batch_size,drop_remainder=True)\
.cache()\
.prefetch(AUTOTUNE)
def create_model():
n_units = 256
max_len = 64
vocab_size = 10000
inputs_tokens = Input(shape=(max_len,), dtype=tf.int32)
# inputs_tokens = Input(shape = (None,), dtype=tf.int32)
embedding_layer = Embedding(vocab_size, 256)
x = embedding_layer(inputs_tokens)
x = LSTM(n_units)(x)
x = Dropout(0.2)(x)
outputs = Dense(vocab_size, activation = 'softmax')(x)
model = Model(inputs=inputs_tokens, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
metric_fn = tf.keras.metrics.SparseCategoricalAccuracy()
model.compile(optimizer="adam", loss=loss_fn, metrics=metric_fn)
return model
当我查看类型规范 train_ds.element_spec 时,我得到:
(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
TensorSpec(shape=(64,), dtype=tf.int64, name=None))
有什么想法我在这里做错了吗?我应该使用 padded_batch 吗?我应该重塑我的数据集吗?
编辑:
我是如何创建train_ds:
我有一个 ~100k 歌词数组作为列表中的字符串,如下所示:
`
['麦克风检查,我可以平滑到任何凹槽', '放松舌头,让我的麦克风巡航', “环游地球,像珍妮特一样把它们打包”,]`
我使用train_test_split 为特征和标签创建测试和训练集,其中标签是每条中倒数第二个单词。
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(train_data.values, tf.string)
)
train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
tf.cast(train_targets.values, tf.int64),
)
然后我创建了这个函数:
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=max_features,
# standardize=lyrics_corpus,
split="whitespace",
ngrams=2,
output_mode="int",
# output_sequence_length=max_len,
# vocabulary=words,
)
def convert_text_input(sample):
text = sample
text = tf.expand_dims(text, -1)
return tf.squeeze(vectorize_layer(text))
应用功能
train_text_ds = train_text_ds_raw.map(convert_text_input,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
将标签和文本重新组合在一起
train_ds = tf.data.Dataset.zip(
(
train_text_ds,
train_cat_ds_raw
)
)
示例表 | |预测器 |标签 |标签 ID | |-----------|-------------------------- --------------|----------|--------| | 0 |麦克风检查,我可以顺利进入任何 groov... |凹槽 | 8167 | | 1 |放松舌头,让我的麦克风好好听听... |邮轮| 4692 | | 2 |环游地球,像简一样把它们收起来... |珍妮特 | 9683 | | 3 |杰克逊,她在问我能不能猛击它,... |我—— | 9191 | | 4 |哟,哟,红人,男人,他妈的,男人?... |人? | 11174|
【问题讨论】:
-
您能展示一下您是如何创建数据集的
train_ds吗? -
添加为编辑
-
谢谢,train_targets.values 到底是什么?整数?
-
标签的整数编码(标签是每行倒数第二个单词)
-
我添加了一个表格示例,降价在编辑中正确显示,但在此处看起来不正确。我在数据框中有歌词、标签和 label_id。
标签: python tensorflow keras tensorflow-datasets