IndexError: List Index out of range Keras Tokenizer答案

【问题标题】：IndexError: List Index out of range Keras TokenizerIndexError: List Index out of range Keras Tokenizer
【发布时间】：2019-02-23 17:47:51
【问题描述】：

我正在使用 Sentiment140 数据集来尝试学习使用 RNN 进行情绪分析。我在网上找到了这个使用keras.imdb 数据源的教程，但是我想尝试使用我自己的数据源，所以我尝试将代码改编为我自己的数据。教程：https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e

数据预处理包括提取系列数据，然后对其进行标记和填充，然后将其发送到模型进行训练。我在下面的代码中执行了这些操作，但每当我尝试运行培训时，我都会得到if isinstance(data[0], list):IndexError: list index out of range。我没有定义data，所以这让我相信我做了一些 keras 或 tensorflow 不喜欢的事情。关于是什么导致此错误的任何想法？

我的数据目前是 csv 文件格式，标题为 SENTIMENT 和 TEXT。 SENTIMENT 是 0 表示否定，1 表示肯定。 TEXT 是收集的已处理推文。这是一个示例。

数据集 CSV（仅查看行以节省空间）

SENTIMENT,TEXT
0,about to file tax
0,ahh i hate dogs
1,My paycheck came in today
1,lot to do before chi this weekend
1,lol love food

代码

import pandas as pd
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import json
import numpy as np


# Load in DS
df = pd.read_csv('./train.csv')
print(df.head())

#Create sequence
vocabulary_size = 1000
tokenizer = Tokenizer(num_words= vocabulary_size, split=' ')
tokenizer.fit_on_texts(df['TEXT'].values)
X_train = tokenizer.texts_to_sequences(df['TEXT'].values)

#Pad Sequence
X_train = pad_sequences(X_train)
print(X_train)

#Get Sentiment
y_train = df['SENTIMENT'].tolist()


#create model
max_words = 24
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

batch_size = 64
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
model.fit(X_train2, y_train2,
    validation_data=(X_valid, y_valid),
    batch_size=batch_size,
    epochs=num_epochs)

输出

Using TensorFlow backend.
   SENTIMENT                                               TEXT
0          0  aww that be bummer You shoulda get david carr ...
1          0  be upset that he can not update his facebook b...
2          0  I dive many time for the ball manage to save t...
3          0      my whole body feel itchy and like its on fire
4          0  no it be not behave at all be mad why be here ...
[[  0   0   0 ...   3  10   5]
 [  0   0   0 ...  46  47  89]
 [  0   0   0 ...  29   9  96]
 ...
 [  0   0   0 ...  30 309 310]
 [  0   0   0 ...   0   0  72]
 [  0   0   0 ...  33 312 313]]
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 24, 32)            32000
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101
=================================================================
Total params: 85,301
Trainable params: 85,301
Non-trainable params: 0
_________________________________________________________________
None
Traceback (most recent call last):
  File "mcve.py", line 50, in <module>
    epochs=num_epochs)
  File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 950, in fit
    batch_size=batch_size)
  File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 787, in _standardize_user_data
    exception_prefix='target')
  File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py", line 79, in standardize_input_data
    if isinstance(data[0], list):
IndexError: list index out of range

JUPYTER 笔记本错误

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-25-184505b70981> in <module>()
     20 model.fit(X_train2, y_train2,
     21     batch_size=batch_size,
---> 22     epochs=num_epochs)
     23 

~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
    948             sample_weight=sample_weight,
    949             class_weight=class_weight,
--> 950             batch_size=batch_size)
    951         # Prepare validation data.
    952         do_validation = False

~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)
    785                 feed_output_shapes,
    786                 check_batch_axis=False,  # Don't enforce the batch size.
--> 787                 exception_prefix='target')
    788 
    789             # Generate sample-wise weight values given the `sample_weight` and

~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
     77                              'for each key in: ' + str(names))
     78     elif isinstance(data, list):
---> 79         if isinstance(data[0], list):
     80             data = [np.asarray(d) for d in data]
     81         elif len(names) == 1 and isinstance(data[0], (float, int)):

IndexError: list index out of range

【问题讨论】：

标签： python-3.x pandas numpy tensorflow keras

【解决方案1】：

编辑
我之前的建议是错误的。我检查了你的代码并运行它，它对我来说没有错误。然后我查看了源代码，standardize_input_data 函数。有一行检查 data 参数：

def standardize_input_data(data,
                           names,
                           shapes=None,
                           check_batch_axis=True,
                           exception_prefix=''):
    """Normalizes inputs and targets provided by users.
    Users may pass data as a list of arrays, dictionary of arrays,
    or as a single array. We normalize this to an ordered list of
    arrays (same order as `names`), while checking that the provided
    arrays have shapes that match the network's expectations.
    # Arguments
        data: User-provided input data (polymorphic).
        ...

在第 79 行：

 elif isinstance(data, list):
        if isinstance(data[0], list):
            ...

因此，如果出现错误，输入数据似乎是list，但列表长度为零。

通过调用 Model._standardize_user_data(...) 在 Model.fit(...) 方法中调用 standartize_input_data 函数。通过这个函数链，传递的data 参数得到Model.fit(x, y, ...) 的x 参数值。所以，我猜是X_train2 或X_valid 的类型或内容的问题。除了X_train 内容之外，您还会提供X_train2 和X_val 吗？

旧的错误建议
我猜你应该将词汇量增加一来处理词汇表外的标记。
即，更改Embedding 层的初始化：

model.add(Embedding(vocabulary_size + 1, embedding_size, input_length=max_words))

根据docs，“input_dim: int > 0. 词汇表的大小，即最大整数索引+1”。
您可以检查最大值。 max(X_train) 的值（已编辑）。

希望对你有帮助！

【讨论】：

好建议，我没想到。我认为我的问题与数据标记化有关。我在 jupyter notebook 中运行了代码，并在上面添加了它的输出。也许这会有所帮助
运行 max(df['TEXT'].values) 返回此字符串 yes what do you want to do，这是集合中最长的字符串。
@ex080 我明天再仔细看看
@ex080 我刚刚改进了我的答案，需要一些额外的数据
我今天一回家就提供数据。