简介

https://machinelearningmastery.com/what-are-word-embeddings/

https://www.zhihu.com/question/32275069

词嵌入是自然语言处理(NLP)中语言模型与表征学习技术的统称。概念上而言,它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中,每个单词或词组被映射为实数域上的向量。

One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities.

Algorithms

1. Embedding Layer

It requires that documents are clean and each words are encoded as one-hot. The size of vector space are specified as the part of the model, such as 50, 100, 300.

This approach of learning an embedding layer requires a lot of training data and can be slow, but will learn an embedding both targeted to the specific text data and the NLP task.

每一个单词都需要一个one-hot vector, 计算量大,单词之间相关性没有被表示。

Word Embedding

As we can look at following picture, word ‘girl’ won’t make any help of the training of other words in the first layer.

Word Embedding

2. Word2Vec

paper: Linguistic Regularities in Continuous Space Word Representations, 2013.

It is good at capturing syntactic and semantic regularities in language.

Two different learning models were introduced that can be used as part of Word2Vec approach to learn word embedding.

  • Continuous Bag-of-Words, or CBOW model. # 通过已知的周围词对该词进行word embedding.
  • Continuous Skip-Gram Model. # 通过预测周围词进行 word embbeding.
Word Embedding

3. GloVe

paper: GloVe: Global Vectors for Word Representation, 2014.

把全局统计(eg. Latent Semantic Analysis (LSA)) 和局部文本学习 (word2vec) 结合起来,更加有效

The Globe Vector for Word Representation. It is an extension to word2vec and can learn word vector more efficiently.

Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like word2vec at capturing meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen example above).

GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec.

Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.

How to use the word embedding

1. Learn an Embedding

You may choose to learn a word embedding for your problem.

This will require a large amount of text data to ensure that useful embeddings are learned, such as millions or billions of words.

You have two main options when training your word embedding:

  1. Learn it Standalone, where a model is trained to learn the embedding, which is saved and used as a part of another model for your task later. This is a good approach if you would like to use the same embedding in multiple models.
  2. Learn Jointly, where the embedding is learned as part of a large task-specific model. This is a good approach if you only intend to use the embedding on one task.

2. Reuse an Embedding

It’s common for researcher to use pre-trained word embbeding. For example, both word2vec and GloVe word embeddings are available for free download.

These can be used on your project instead of training your own embeddings from scratch.

可以选择直接使用,也可以基于此进行更新。

Articles

Papers

Projects

Word2Vec 算法

obal Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)

Word2Vec 算法

Word Embedding

相关文章: