填充和屏蔽批处理数据集答案

【问题标题】：Padding and Masking a batch dataset填充和屏蔽批处理数据集
【发布时间】：2020-08-06 07:50:45
【问题描述】：

表示多个自然语言字符串时，每个字符串中的字符数可能不相等。然后，可以将返回结果放在tf.RaggedTensor 中，其中最内层维度的长度取决于每个字符串中的字符数：

rtensor = tf.ragged.constant([
                      [1, 2], 
                      [3, 4, 5],
                      [6]
                      ])
rtensor
#<tf.RaggedTensor [[1, 2], [3, 4, 5], [6]]>

反过来，应用 to_tensor 方法，将 RaggedTensor 转换为常规 tf.Tensor 并因此应用填充操作：

batch_size=3
max_length=8
tensor = rtensor.to_tensor(default_value=0, shape=(batch_size, max_length))
#<tf.Tensor: shape=(3, 8), dtype=int32, numpy=
#array([[1, 2, 0, 0, 0, 0, 0, 0],
#       [3, 4, 5, 0, 0, 0, 0, 0],
#       [6, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>

现在，有没有一种方法可以生成一个附加张量来显示什么是原始数据和什么是填充？对于上面的示例，它将是：

<tf.Tensor: shape=(3, 8), dtype=int32, numpy=
array([[1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>

【问题讨论】：

tf.math.not_equal(tensor, 0)?

标签： numpy tensorflow tensorflow2.0 tensorflow-datasets

【解决方案1】：

正如thusv89 建议的那样，您可以简单地检查非零值。它可以像转换为布尔值并返回一样简单。

import tensorflow as tf

rtensor = tf.ragged.constant([[1, 2],
                              [3, 4, 5],
                              [6]])
batch_size = 3
max_length = 8
tensor = rtensor.to_tensor(default_value=0, shape=(batch_size, max_length))
mask = tf.dtypes.cast(tf.dtypes.cast(tensor, tf.bool), tensor.dtype)
print(mask.numpy())
# [[1 1 0 0 0 0 0 0]
#  [1 1 1 0 0 0 0 0]
#  [1 0 0 0 0 0 0 0]]

唯一可能的缺点是您最初可能有 0 值。在转换为张量时，您可以使用其他值作为默认值，例如 -1，如果您知道您的数据总是非负数：

tensor = rtensor.to_tensor(default_value=-1, shape=(batch_size, max_length))
mask = tf.dtypes.cast(tensor >= 0, tensor.dtype)

但是，如果您希望您的掩码适用于您拥有的任何值，您也可以将tf.ones_like 与不规则张量一起使用：

rtensor_ones = tf.ones_like(rtensor)
mask = rtensor_ones.to_tensor(default_value=0, shape=(batch_size, max_length))

这样mask 将始终是rtensor 具有值的位置。

【讨论】：

谢谢@jdehesa 和@thosehv89。这个解决方案非常有趣，因为我有一个要标记的大型数据集。