强制 Tensorflow 的 Tokenizer 包含“下一行”字符答案

【问题标题】：Forcing Tensorflow's Tokenizer to include "Next line" char强制 Tensorflow 的 Tokenizer 包含“下一行”字符
【发布时间】：2019-10-28 12:48:05
【问题描述】：

我正在尝试使用 tensorflow 来模拟波斯诗歌。为此，我需要在我的标记中包含 '\n'。但是，当我使用tokenizer() 时，它不包括下一行。 tf.keras.preprocessing.text.Tokenizer 是否可以包含 '\n'？

data = open(link + "/hafez.txt").readlines()  # removing the first two lines
data = data[2:]
data = ''.join(data)
corpus = data.lower().split("\n")
for c in corpus: # including \n in the text
    c += '\n'
# update the vocab based on the list of texts ( corpus) returns a dictionary
# of Vocabulary
tokenizer.fit_on_texts(corpus)
print(tokenizer.word_index['\n'])

现在，我们看到 '\n' 不包括在内。

KeyError Traceback（最近调用最后）在（） ----> 1 tokenizer.word_index['\n'] 键错误：'\n'

但是，我稍后需要这个，以便我的神经网络有望将生成的单词按'\n' 分割。

【问题讨论】：

您是否阅读了您链接的文档？ __init__() 接受一个参数 filters。 \n 在过滤器中。重新定义没有它的字符串。
谢谢。我想我需要使用： filter = '!"#$%&()*+,-./:;?@[\]^_`{|}~\t' 我可以使用正则表达式或更简单吗如何做到这一点？

标签： python tensorflow tokenize

【解决方案1】：

如果你从filters 参数中删除'\n'，我想你会得到你想要的。

示例：

import tensorflow as tf

corpus = "it was the best of times, it was the worst of times"
corpus = [c + '\n' for c in corpus.split()]

filters_ = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t'  # <= removed '\n'
my_tokenizer = tf.keras.preprocessing.text.Tokenizer(
    filters=filters_,
    char_level=True)

my_tokenizer.fit_on_text(corpus)
eol_idx = my_tokenizer.word_index['\n']

print(eol_idx)
#  1

【讨论】：

谢谢。我想在我的情况下，因为我想使用 \n 作为令牌。我必须让 char_level=False，然后简单地在每个 '\n' 中添加一个 OOV 并重新运行标记器。