【问题标题】:How to ignore characters while tokenizing Keras如何在标记 Keras 时忽略字符
【发布时间】:2019-01-14 01:10:27
【问题描述】:

我正在尝试使用 Keras 训练和构建标记器,这是我正在执行此操作的代码的 sn-p:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense

txt1="""What makes this problem difficult is that the sequences can vary in length,
be comprised of a very large vocabulary of input symbols and may require the model 
to learn the long term context or dependencies between symbols in the input sequence."""

#txt1 is used for fitting 
tk = Tokenizer(nb_words=2000, lower=True, split=" ",char_level=False)
tk.fit_on_texts(txt1)

#convert text to sequencech
t= tk.texts_to_sequences(txt1)

#padding to feed the sequence to keras model
t=pad_sequences(t, maxlen=10)

在测试 Tokenizer 学习了哪些单词时,它给出了它只学习了字符而不是单词。

print(tk.word_index)

输出:

{'e': 1, 't': 2, 'n': 3, 'a': 4, 's': 5, 'o': 6, 'i': 7, 'r': 8, 'l': 9, 'h': 10, 'm': 11, 'c': 12, 'u': 13, 'b': 14, 'd': 15, 'y': 16, 'p': 17, 'f': 18, 'q': 19, 'v': 20, 'g': 21, 'w': 22, 'k': 23, 'x': 24}

为什么没有字?

此外,如果我打印 t,它清楚地表明,单词被忽略并且每个单词都被 char 标记化 char

print(t)  

输出:

[[ 0  0  0 ...  0  0 22]
 [ 0  0  0 ...  0  0 10]
 [ 0  0  0 ...  0  0  4]
 ...
 [ 0  0  0 ...  0  0 12]
 [ 0  0  0 ...  0  0  1]
 [ 0  0  0 ...  0  0  0]]

【问题讨论】:

    标签: python keras nlp tokenize


    【解决方案1】:

    我发现了错误。 如果文本按以下方式传递:

    txt1=["""What makes this problem difficult is that the sequences can vary in length,
    be comprised of a very large vocabulary of input symbols and may require the model 
    to learn the long term context or dependencies between symbols in the input sequence."""]
    

    使用括号,它会工作得很好。 这是 t 的新输出:

    print(t)
    
    
    [[30 31 32 33 34  5  2  1  4 35]]
    

    这意味着该函数接受一个列表而不仅仅是一个文本

    【讨论】:

      【解决方案2】:

      试试这个

      from keras.preprocessing.text import Tokenizer
      txt1='What makes this problem difficult is that the sequences can vary in length,
      be comprised of a very large vocabulary of input symbols and may require the model 
      to learn the long term context or dependencies between symbols in the input sequence.'
      
      t = Tokenizer()
      t.fit_on_texts(txt1)
      # summarize what was learned
      print(t.word_counts)
      print(t.document_count)
      print(t.word_index)
      print(t.word_docs)
      

      复制粘贴并运行。 我假设问题首先出现在输入文本“你有 3 个引号”周围的引号中。其次你不必执行t= tk.texts_to_sequences(txt1) 而是这样做

      encoded_txt = t.texts_to_matrix(txt1, mode='count')
      print(encoded_txt)
      

      其他解决方法是

      from keras.preprocessing.text import text_to_word_sequence
      text = txt1
      # estimate the size of the vocabulary
      words = set(text_to_word_sequence(text))
      vocab_size = len(words)
      print(vocab_size)
      

      【讨论】:

      • 这不是所需要的,+它没有给我我需要的东西:(
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-02-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多