时期的张量流样本大小答案

【问题标题】：tensorflow sample sizes for epochs时期的张量流样本大小
【发布时间】：2020-09-24 06:00:26
【问题描述】：

我有一个包含 50000 个项目的数据集：评论和情绪（正面或负面）

我将 90% 分配给训练集，其余分配给测试集。

我的问题是，如果我在现有的训练集上运行 5 个 epoch，每个 epoch 不应该加载 9000 而不是 1407？

# to divide train & test sets
test_sample_size = int(0.1*len(preprocessed_reviews))  # 10% of data as the validation set

# for sentiment
sentiment = [1 if x=='positive' else 0 for x in sentiment]

# separate data to train & test sets
X_test, X_train = (np.array(preprocessed_reviews[:test_sample_size]), 
                   np.array(preprocessed_reviews[test_sample_size:])
)

y_test, y_train = (np.array(sentiment[:test_sample_size]), 
                   np.array(sentiment[test_sample_size:])
)

tokenizer = Tokenizer(oov_token='<OOV>')  # for the unknown words
tokenizer.fit_on_texts(X_train)

vocab_count = len(tokenizer.word_index) + 1  # +1 is for padding


training_sequences = tokenizer.texts_to_sequences(X_train)  # tokenizer.word_index to see indexes
training_padded = pad_sequences(training_sequences, padding='post')  # pad sequences with 0s 
training_normal = preprocessing.normalize(training_padded)  # normalize data

testing_sequences = tokenizer.texts_to_sequences(X_test)  
testing_padded = pad_sequences(testing_sequences, padding='post')  
testing_normal = preprocessing.normalize(testing_padded)  


input_length = len(training_normal[0])  # length of all sequences


# build a model
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=vocab_count, output_dim=2,input_length=input_length))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(63, activation='relu'))  # hidden layer
model.add(keras.layers.Dense(16, activation='relu'))  # hidden layer
model.add(keras.layers.Dense(1, activation='sigmoid'))  # output layer

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(training_normal, y_train, epochs=5)

输出：

Epoch 1/5
1407/1407 [==============================] - 9s 7ms/step - loss: 0.6932 - accuracy: 0.4992
Epoch 2/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5030
Epoch 3/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.4987
Epoch 4/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5024
Epoch 5/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5020

对不起，我对 tensorflow 很陌生，希望有人能帮忙！

【问题讨论】：

每个时期都将在整个火车数据集上进行训练，即在您的情况下为 45000。考虑到批量大小 32（默认值），1407 将是总批次
@GirishDattatrayHegde 哦，我明白了.. 感谢您的解释

标签： python tensorflow machine-learning keras

【解决方案1】：

因此，如果您有大约 50,000 个数据点，以 90/10 的比率（训练/测试）分布，这意味着约 45,000 个将是训练数据，其余 5000 个将用于测试。当您调用 fit 方法时，Keras 将 batch_size 的默认参数设置为 32（您始终可以将其更改为 64、128..）所以数字 1407 告诉你模型需要做 1407 个前馈和反向传播步骤，才能完成一个完整的 epoch（因为 1407 * 32 ~ 45,000）。

【讨论】：