如何标记字符串列表列表的列表答案

【问题标题】：How to Tokenize a list of lists of lists of strings如何标记字符串列表列表的列表
【发布时间】：2020-01-21 22:31:09
【问题描述】：

我有一个文本数据集，它是字符串列表的列表。我需要Tokenize 将此数据放入分类模型中。我非常熟悉使用keras.preprocessing.text.Tokenizer 这样做，并且经常使用以下代码：

data = 
    [[['not'],
      ['ahead'],
      ['um let me think'],
      ['thats not very encouraging if they had a cast of thousands on the other end']],
      [['okay civil liberties tell me your position'],
     ['probably would go ahead']],
     [['oh'],
     ['it up so i dont know where you really go'],
     ['well most of my problem with this latest task'],
     ['its some i kind of dont want to put in the time to do it'],
     ['right so im saying ive got a lot of other things to do']]]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)

当我在我的数据上运行此代码时，我收到以下错误：

2 frames
<ipython-input-44-1da804f42cc8> in main()
     12     # tokenize and vectorize text data to prepare for embedding
     13     tokenizer = Tokenizer()
---> 14     tokenizer.fit_on_texts(new_corpus)
     15     sequences = tokenizer.texts_to_sequences(new_corpus)
     16     word_index = tokenizer.word_index

/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in fit_on_texts(self, texts)
    213                 if self.lower:
    214                     if isinstance(text, list):
--> 215                         text = [text_elem.lower() for text_elem in text]
    216                     else:
    217                         text = text.lower()

/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in <listcomp>(.0)
    213                 if self.lower:
    214                     if isinstance(text, list):
--> 215                         text = [text_elem.lower() for text_elem in text]
    216                     else:
    217                         text = text.lower()

AttributeError: 'list' object has no attribute 'lower'

这对我来说很有意义，因为Tokenizer 函数需要一个字符串但得到一个列表。通常，我会展平我的列表结构以通过Tokenizer 函数传递它。

但是，我不能这样做，因为我的嵌套列表结构对于我的建模至关重要。

那么，我可以Tokenize 我的数据同时保留列表结构吗？我想把整个东西当作我的语料库，并在所有列表中获得唯一的单词整数标记。

它应该看起来像这样（这里是手工完成的标记化，如果有错字请见谅）：

data = 
    [[[0],
      ['1'],
      ['2, 3, 4, 5'],
      ['6, 0, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19']],
      [['20, 21, 22, 23, 3, 24, 25'],
     ['26, 27, 28, 29']],
     [['30'],
     ['31, 32, 33, 34, 35, 36, 37, 38, 39, 40'],
     ['41, 42, 43, 44, 45, 46, 47, 48, 49'],
     ['50, 51, 34, 52, 14, 35, 53, 54, 55, 56, 17, 57, 58, 59, 31'],
     ['60, 61, 62, 63, 64, 65, 12, 66, 14, 67, 68, 59, 31']]]

【问题讨论】：

tokenizer 只能标记list of lists。因此，将您的 list of list of lists 转换为 list of lists 就这么简单。编辑：只需阅读您需要保留结构。不幸的是，Tokenizer 不能处理任意结构。如果要保留结构，则必须手动将每个单词转换为标记

标签： python tensorflow keras nlp token

【解决方案1】：

您可以执行以下操作来保留结构并进行索引，

tok_data = [y[0] for x in data for y in x]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(tok_data)

sequences = []
for x in data:
  tmp = []
  for y in x:
    tmp.append(tokenizer.texts_to_sequences(y)[0])
  sequences.append(tmp)

【讨论】：