Tensorflow TextVectorization 层：如何定义自定义标准化函数？答案

【问题标题】：Tensorflow TextVectorization layer: How to define a custom standardize function?Tensorflow TextVectorization 层：如何定义自定义标准化函数？
【发布时间】：2021-03-30 21:47:54
【问题描述】：

我尝试为 TextVectorization layer in Tensorflow 2.1 创建一个自定义标准化函数，但我似乎得到了一些根本性的错误。

我有以下文本数据：

import numpy as np

my_array = np.array([
    "I am a sentence.",
    "I am another sentence!"
])

我的目标

我基本上是想降低文本，删除标点符号并删除一些单词。 TextVectorization 层 (LOWER_AND_STRIP_PUNCTUATION) 的默认标准化功能会降低并删除标点符号，但是无法删除整个单词。

（如果您知道这样做的方法，当然也非常感谢下面描述的我的替代方法）

一个有效的例子

首先，找到一个工作自定义标准化函数from the tensorflow documentation

的示例

def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

当我将它传递给 TextVectorization 并适应 my_array 时，它工作得很好

vectorize_layer_1 = TextVectorization(
    output_mode='int',
    standardize=custom_standardization,
    )

vectorize_layer_1.adapt(my_array)  # no error

自定义功能不起作用

但是，我的自定义标准化不断引发错误。这是我的代码：

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.preprocessing.text import text_to_word_sequence

my_array = np.array([
    "I am a sentence",
    "I am another sentence"
])

# these words should be removed
bad_words = ["i", "am"]

def remove_words(tokens):
    return [word for word in tokens if word not in bad_words]

# this is the normalization function I want to apply
def my_custom_normalize(my_array):
    tokenized = [text_to_word_sequence(str(sentence)) for sentence in my_array]
    clean_texts = [" ".join(remove_words(tokenized_string))
                     for tokenized_string
                     in tokenized]
    clean_tensor = tf.convert_to_tensor(clean_texts)
    return clean_tensor
    
my_vectorize_layer = TextVectorization(
    output_mode='int',
    standardize=my_custom_normalize,
    )

但是，一旦我尝试适应，我就一直在错误中运行：

my_vectorize_layer.adapt(my_array)  # raises error

InvalidArgumentError: Tried to squeeze dim index 1 for tensor with 1 dimensions. [Op:Squeeze]

我真的不明白为什么。在documentation 中它说：

当使用自定义可调用对象进行标准化时，可调用对象接收到的数据将与传递给该层的数据完全相同。可调用对象应返回与输入形状相同的张量

我认为这可能是导致错误的原因。但是当我查看形状时，一切似乎都是正确的：

my_result = my_custom_normalize(my_array)
my_result.shape  # returns TensorShape([2])
working_result = custom_standardization(my_array)
working_result.shape # returns TensorShape([2])

我真的迷路了。我究竟做错了什么？我不应该使用列表推导吗？

【问题讨论】：

我认为使用带有tf.strings.regex_replace 的正则表达式来删除单词会更好。我从未在keras中使用过TextVectorization，但是查看源代码，这似乎是导致错误的行：github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/python/…
只是一个想法。尝试将 my_custom_normalize 的正文替换为：return tf.strings.regex_replace(my_array, "(?i)i|am", "")
实际上，要删除的单词会更多，（>200），所以将它们放入正则表达式会很麻烦......

标签： python numpy tensorflow keras text

【解决方案1】：

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    stripped_html = tf.strings.regex_replace(stripped_html,r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', ' ')
    stripped_html = tf.strings.regex_replace(stripped_html, r'@([A-Za-z0-9_]+)', ' ' )
    for i in stopwords_eng:
        stripped_html = tf.strings.regex_replace(stripped_html, f' {i} ', " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape(string.punctuation), ""
    )

【讨论】：