由于负索引查找，Keras 嵌入层中的 InvalidArgumentError答案

【问题标题】：InvalidArgumentError in Keras Embedding layer due to a negative index lookup由于负索引查找，Keras 嵌入层中的 InvalidArgumentError
【发布时间】：2018-04-30 12:38:21
【问题描述】：

我在尝试训练在 Keras 中实现的深度学习模型时收到了 InvalidArgumentError。我在 Keras 和 TensorFlow 中搜索过类似的问题，但是由于找不到索引，我的错误消息似乎不寻常。以下是错误信息。

tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[427,9] = -2147483648 不在 [0, 38545) [[节点：time_distributed_1/Gather = Gather[Tindices=DT_INT32，Tparams=DT_FLOAT，validate_indices=true，_device="/job:localhost/replica:0/task:0/device:CPU:0"](嵌入/读取， time_distributed_1/Cast)]]

我使用 Python 3.5.2，TensorFlow 版本是 1.4.1，Keras 版本是 2.1.5。

您可以注意到，不仅要寻找的索引是负数，它实际上等于 -2^31。（即最低的 32 位有符号整数值）

下面是我用来准备模型的代码。

import numpy
from keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed
from keras_contrib.layers import CRF

# Form embedding layer's weight matrix
V = len(word_to_index) + 1  # V = 38545
embedding_weights = numpy.zeros((V, N))
for word, index in word_to_index.items():
    embedding_weights[index, :] = word_vec_dict[word]

embedding_layer = Embedding(V, N,
                            weights=[embedding_weights], mask_zero=True)

model = Sequential()
model.add(TimeDistributed(embedding_layer,
                          input_shape=(C, U)))

model.add(TimeDistributed(Bidirectional(LSTM(M // 2, return_sequences=True))))
model.add(TimeDistributed(GlobalMaxPooling1D()))
model.add(Bidirectional(LSTM(H // 2, return_sequences = True), merge_mode='concat'))
crf = CRF(num_tags, sparse_target=True)
model.add(crf)
model.compile('adam', loss = crf.loss_function, metrics=[crf.accuracy])

提供给该模型的数据的维度为(C, U, N)，类型为int。（即不包括批大小维度B）简单地说，一批中的每个样本都是长度为C的对话。每个对话都包含固定长度的话语U。最后，每个话语都由N 正索引组成。（即词汇表中相关词的索引）

我什至使用简单的 for 循环检查了整个数据集（在将其转换为索引之后），并且在[0, 38545) 范围之外找不到任何索引值。为什么训练时会出现-2^31 这样的索引循环？

【问题讨论】：

@DanielMöller 正如我的问题所提到的，我使用 for 循环遍历数据检查了整个数据集是否存在超出范围的值。没有。此外，不仅如此，我实际上将数据打印到文本文件中，并使用文本编辑器搜索减号字符-。找不到。
38545 大于最大 int 16 值。你确定你所有的模型层、数据数组等都是 int32 的吗？如果某些内容溢出，通常会出现 Keras 中的下限值...
@DanielMöller 我通过填充大小为(B, C, U) 的初始为空的numpy 数组来形成我的训练验证和测试批次，如下所示：numpy.empty(shape = (B, C, U), dtype=int)。它的dtype 是int64。至于模型的其余部分，我不确定如何执行int64，因为没有任何参数。此外，错误声明似乎提到该错误专门发生在模型的第一层。

标签： python-3.x tensorflow keras

【解决方案1】：

我终于解决了这个问题。我在训练模型时使用了批处理生成，并且我在批处理生成器函数中留下了未初始化的输入数组的一部分。

我不清楚为什么要查找的索引是 -2147483648，确切地说。但是，我认为，由于数组的未初始化部分包含大于词汇表大小的值，甚至是 32 位整数的边界，因此会导致未定义的行为。

在我相应地正确初始化整个批处理输入后，问题就解决了。下面是我使用的批处理生成器函数的简化版本。添加的初始化部分有一个注释，以突出上面叙述的内容。

def batch_generator(dataset_x, dataset_y, tag_indices, mini_batch_list, C, U,
                    num_tags, word_index_to_append, tag_index_to_append):
    num_mini_batches = len(mini_batch_list)

    index_list = [x for x in range(num_mini_batches)]
    random.shuffle(index_list)

    k = -1
    while True:
        k = (k + 1) % len(index_list)
        index = index_list[k]
        conversation_indices = mini_batch_list[index]

        num_conversations = len(conversation_indices)
        batch_features = numpy.empty(shape = (num_conversations, C, U),
                                     dtype = int)
        label_list = []

        for i in range(num_conversations):
            utterances = dataset_x[conversation_indices[i]]
            labels = copy.deepcopy(dataset_y[conversation_indices[i]])
            num_utterances = len(utterances)
            num_labels_to_append = max(0, C - len(labels))
            labels += [tag_index_to_append] * num_labels_to_append
            tags = to_categorical(labels, num_tags)
            del labels

            for j in range(num_utterances):
                utterance = copy.deepcopy(utterances[j])
                num_to_append = max(0, U - len(utterance))
                if num_to_append > 0:
                    appendage = [word_index_to_append] * num_to_append
                    utterance += appendage

                batch_features[i][j] = utterance

            # ADDING THE TWO LINES BELOW SOLVED THE ISSUE
            remaining_space = (C - num_utterances, U)
            batch_features[i][num_utterances:] = numpy.ones(remaining_space) *\
                                                 word_index_to_append
            label_list.append(tags)

        batch_labels = numpy.array(label_list)
        yield batch_features, batch_labels

【讨论】：