Tensorflow：如何使用 tf 数据集构建高效的 NLP 管道答案

【问题标题】：Tensorflow : How to build efficient NLP pipeline using tf DatasetTensorflow：如何使用 tf 数据集构建高效的 NLP 管道
【发布时间】：2020-03-29 08:50:25
【问题描述】：

我正在使用 TensorFlow 并尝试使用 tf.dataset API 创建高效的训练和推理管道，但遇到了一些错误：

比如一个简单的RNN网络结构是这样的：

import tensorflow as tf
import numpy as np
# hyper parameters
vocab_size          = 20
word_embedding_dim  = 100
batch_size          = 2



tf.reset_default_graph()
# placeholders
sentences             = tf.placeholder(tf.int32, [None,None], name='sentences')
targets               = tf.placeholder(tf.int32, [None, None], name='labels' )
keep_prob             = tf.placeholder(tf.float32, [1,], name='dropout')
keep_prob             = tf.cast(keep_prob.shape[0],tf.float32)


# embedding
word_embedding         = tf.get_variable(name='word_embedding_',
                                             shape=[vocab_size, word_embedding_dim],
                                             dtype=tf.float32,
                                             initializer = tf.contrib.layers.xavier_initializer())
embedding_lookup = tf.nn.embedding_lookup(word_embedding, sentences)



#  bilstm model
with tf.variable_scope('forward'):
    fr_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_fr = tf.contrib.rnn.DropoutWrapper(fr_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('backward'):
    bw_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_bw = tf.contrib.rnn.DropoutWrapper(bw_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('bi-lstm') as scope:
    model,last_state = tf.nn.bidirectional_dynamic_rnn(dropout_fr,
                                                       dropout_bw,
                                                       inputs=embedding_lookup,
                                                       dtype=tf.float32)

logits             = tf.transpose(tf.concat(model, 2), [1, 0, 2])[-1]
linear_projection  = tf.layers.dense(logits, 5)



#loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = linear_projection, labels = tf.cast(targets,tf.float32))
loss = tf.reduce_mean(tf.reduce_sum(cross_entropy, axis=1))
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(loss)

而虚拟数据是：

dummy_data    = [[1,3,4,5,5,12],[1,3,4,4,12,0],[12,4,12,0,0,0],[1,3,4,5,5,12]]
dummpy_labels = [[1,0,0,0,0],[0,1,0,1,0],[1,0,0,0,0],[0,1,0,1,0]]

现在我通常如何通过手动定义切片和填充序列来训练这个网络：

#  pad and slice 


def get_train_data(batch_size, slice_no):

    batch_data_j = np.array(dummy_data[slice_no * batch_size:(slice_no + 1) * batch_size])
    batch_labels = np.array(dummpy_labels[slice_no * batch_size:(slice_no + 1) * batch_size])

    max_sequence = max(list(map(len, batch_data_j)))

    # getting Max length of sequence
    padded_sequence = [i + [0] * (max_sequence - len(i)) if len(i) < max_sequence else i for i in batch_data_j]
    return padded_sequence, batch_labels




# dropout 0.2 during training and 0.0 during inference
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):

        sentences_, labels_ = get_train_data(2,iter_)
        loss_,_ = sess.run([loss,optimizer], feed_dict= {sentences: sentences_, targets: labels_, keep_prob : 0.2})
        print(loss_)

现在想使用 tf 数据集管道来构建一个高效的训练和推理管道。我浏览了一些教程，但找不到好的答案。

我尝试使用 tf.dataset 之类的：

dataset = tf.data.Dataset.from_tensor_slices((sentences,targets,keep_prob))
dataset = dataset.batch(batch_size)

iterator = tf.data.Iterator.from_structure(dataset.output_types)
iterator_initializer_ = iterator.make_initializer(dataset, name='initializer')
sentec, labels, drop_  = iterator.get_next()



def initialize_iterator(sess, sentences_, labels_, drops_):

        feed_dict = {sentences: sentences_, targets: labels_, keep_prob : [np.random.randint(0,2,[1,]).astype(np.float32)]}

        return sess.run(iterator_initializer_, feed_dict = feed_dict)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):
        initialize_iterator(sess, dummy_data, dummpy_labels, [0.0])
        los, _ = sess.run([loss, optimizer])
        print(los)

但是我遇到了一个错误。

那么，什么是训练 RNN 和编码、使用数据集 api 填充丢失序列的有效管道？

【问题讨论】：

得到错误是什么意思？为什么这里似乎有两个问题；你想让你的代码更高效（不管这意味着什么），还是修复错误？
@AlexanderCécile 我知道提供 dict 的旧方法，我正在寻找一种使用 tf.dataset api 训练该模型的有效方法。
对不起？我不确定这与我的评论有什么关系。
@AlexanderCécile 我的意思是，我只是不想纠正那个错误，我正在寻找一种有效的方法来使用 tf.dataset API 来训练该网络，我使用的是旧方法，我已在示例中显示。

标签： python tensorflow keras deep-learning tensorflow-datasets

【解决方案1】：

我建议您将数据准备成其他格式：例如，CSV 或 TFRecord。然后你可以使用tf.data.experimental.make_csv_dataset或tf.data.TFRecordDataset直接将数据读入tf.data对象。

有关于这个主题的教程here。

如果您使用TFRecord，一个示例（tensorflow.Example proto buffer 文本格式）如下所示：

features {
  feature {
    key: "sentences"
    value {
      int64_list {
        value: "0"
        value: "55"
        value: "128"
      }
    }
  }
  feature {
    key: "targets"
    value {
      int64_list {
        value: "10001"
        value: "10002"
      }
    }
  }
}

我会使用keep_prob 和batch_size 作为模型配置参数。您无需将它们嵌入到示例中。

一旦您拥有上述TFRecord 格式的训练和评估示例created 并序列化，就可以直接构建数据管道。

dataset = tf.data.TFRecordDataset(filenames = [your_tf_record_file])

基于数据集，您可以构建您的 input_fn，然后您可以继续使用 tf.Estimator api 或 Keras API。一个示例教程是here。

【讨论】：