如何从预训练的词嵌入数据集中创建 Keras 嵌入层？答案

【问题标题】：How do I create a Keras Embedding layer from a pre-trained word embedding dataset?如何从预训练的词嵌入数据集中创建 Keras 嵌入层？
【发布时间】：2018-07-18 12:38:21
【问题描述】：

如何将预训练的词嵌入加载到 Keras Embedding 层中？

我下载了glove.6B.50d.txt（来自https://nlp.stanford.edu/projects/glove/ 的glove.6B.zip 文件），但我不确定如何将它添加到Keras 嵌入层。见：https://keras.io/layers/embeddings/

【问题讨论】：

这里如何将 GENSIM 模型合并到 Keras stackoverflow.com/a/62747179/10375049

标签： python tensorflow keras word2vec word-embedding

【解决方案1】：

有一篇很棒的博客文章描述了如何使用预训练的词向量嵌入创建嵌入层：

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

上面文章的代码可以在这里找到：

https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py

另一个用于相同目的的好博客：https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

【讨论】：

我的印象是 keras.layers.Embedding 和 weights 的格式如果你检查这个 (keras.io/layers/embeddings) 和这个 (github.com/tensorflow/tensorflow/issues/14392) 就会被弃用

【解决方案2】：

您需要将 embeddingMatrix 传递给Embedding 层，如下所示：

Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)

vocabLen: 词汇表中的记号数
embDim：嵌入向量维度（在您的示例中为 50）
embeddingMatrix：嵌入矩阵由 glove.6B.50d.txt 构建
isTrainable: 是希望嵌入可训练还是冻结层

glove.6B.50d.txt 是一个由空格分隔的值的列表：单词标记 + (50) 个嵌入值。例如the 0.418 0.24968 -0.41242 ...

从 Glove 文件创建 pretrainedEmbeddingLayer：

# Prepare Glove File
def readGloveFile(gloveFile):
    with open(gloveFile, 'r') as f:
        wordToGlove = {}  # map from a token (word) to a Glove embedding vector
        wordToIndex = {}  # map from a token to an index
        indexToWord = {}  # map from an index to a token 

        for line in f:
            record = line.strip().split()
            token = record[0] # take the token (word) from the text line
            wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)

        tokens = sorted(wordToGlove.keys())
        for idx, tok in enumerate(tokens):
            kerasIdx = idx + 1  # 0 is reserved for masking in Keras (see above)
            wordToIndex[tok] = kerasIdx # associate an index to a token (word)
            indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above

    return wordToIndex, indexToWord, wordToGlove

# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
    vocabLen = len(wordToIndex) + 1  # adding 1 to account for masking
    embDim = next(iter(wordToGlove.values())).shape[0]  # works with any glove dimensions (e.g. 50)

    embeddingMatrix = np.zeros((vocabLen, embDim))  # initialize with zeros
    for word, index in wordToIndex.items():
        embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding

    embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
    return embeddingLayer

# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...

【讨论】：

我可以使用词嵌入作为输出层中词的向量表示吗？
我的印象是 keras.layers.Embedding 和 weights 的格式如果你检查这个 (keras.io/layers/embeddings) 和这个 (github.com/tensorflow/tensorflow/issues/14392) 就会被弃用
该死的，事情变化太快了！我认为最新版本应该使用embeddings_initializer=Constant(embeddingMatrix)
请注意，对于某些版本的 Keras，在使用 Constant 传递给 embeddings_initializer、see here for details 时存在一个特别讨厌的错误。

【解决方案3】：

几年前，我编写了一个名为 embfile 的实用程序包，用于处理“嵌入文件”（但我直到 2020 年才发布它）。我想介绍的用例是创建一个预训练的嵌入矩阵来初始化Embedding 层。我想通过尽可能快地加载我需要的词向量来做到这一点。

支持多种格式：

.txt（带或不带“标题行”）
.bin，谷歌 Word2Vec 格式
.vvm，我使用的一种自定义格式（它只是一个 TAR 文件，在单独的文件中包含词汇表、向量和元数据，因此可以在几分之一秒内完全读取词汇表，并且可以随机访问向量）。

包是extensively documented 并经过测试。还有examples that show how to use it with Keras。

import embfile

with embfile.open(EMBEDDING_FILE_PATH) as f:

    emb_matrix, word2index, missing_words = embfile.build_matrix(
        f, 
        words=vocab,     # this could also be a word2index dictionary as well
        start_index=1,   # leave the first row to zeros 
    )

此函数还处理文件词汇表之外的单词的初始化。默认情况下，它在找到的向量上拟合正态分布，并将其用于生成新的随机向量（这就是 AllenNLP 所做的）。我不确定此功能是否仍然相关：现在您可以使用 FastText 或其他方式为未知单词生成嵌入。

请记住，txt 和 bin 文件本质上是连续文件，需要进行全面扫描（除非您在结尾之前找到所有要查找的单词）。这就是我使用 vvm 文件的原因，它为向量提供随机访问。仅仅通过索引顺序文件就可以解决这个问题，但是 embfile 没有这个功能。尽管如此，您可以将顺序文件转换为 vvm（这类似于创建索引并将所有内容打包在一个文件中）。

【讨论】：

【解决方案4】：

我正在寻找类似的东西。我发现这篇博客文章回答了这个问题。它正确解释了创建embedding_matrix 并将其传递给Embedding() 层的热点。

GloVe Embeddings for deep learning in Keras.

【讨论】：