Tutorial on word2vector using GloVe and Word2Vec

Tutorial on word2vector using GloVe and Word2Vec

2018-05-04 10:02:53

Some Important Reference Pages First:

Reference Page: https://github.com/IliaGavrilov/NeuralMachineTranslationBidirectionalLSTM/blob/master/1_Bidirectional_LSTM_Eng_to_French.ipynb

Glove Project Page: https://nlp.stanford.edu/projects/glove/

Word2Vec Project Page: https://code.google.com/archive/p/word2vec/

Pre-trained word2vec model: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

Gensim Tutorial: https://radimrehurek.com/gensim/models/word2vec.html

===================================

===== 　　　For the Glove 　　　　　　　　　　　　　　

===================================

1. Download one of the pre-trained model from Glove project page and Unzip the files.

1 Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip
2 Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
3 Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip

2. Install the needed packages:

pickle, numpy, re, pickle, collections, bcolz

3. Run the following demo to test the results (extract the feature of given single word).

Code:

 1 import pickle 
 2 import numpy as np
 3 import re, pickle, collections, bcolz
 4 
 5 # with open('./glove.840B.300d.txt', 'r', encoding="utf8") as f:
 6 
 7 with open('./glove.6B.200d.txt', 'r') as f:
 8     lines = [line.split() for line in f]
 9 
10 print('==>> begin to load Glove pre-trained models.')
11 glove_words = [elem[0] for elem in lines] 
12 glove_words_idx = {elem:idx for idx, elem in enumerate(glove_words)}    # is elem: idx equal to glove_words_idx[elem]=idx? 
13 glove_vecs = np.stack(np.array(elem[1:], dtype=np.float32) for elem in lines) 
14 
15 print('==>> save into .pkl files.')
16 pickle.dump(glove_words, open('./glove.6B.200d.txt'+'_glove_words.pkl', 'wb')) 
17 pickle.dump(glove_words_idx, open('./glove.6B.200d.txt'+'_glove_words_idx.pkl', 'wb')) 
18 
19 ## saving array using specific function. 
20 def save_array(fname, arr): 
21     c=bcolz.carray(arr, rootdir=fname, mode='w')
22     c.flush()
23 
24 save_array('./glove.6B.200d.txt'+'_glove_vecs'+'.dat', glove_vecs) 
25 
26 def load_glove(loc):
27     return (bcolz.open(loc+'_glove_vecs.dat')[:],
28         pickle.load(open(loc+'_glove_words.pkl', 'rb')),
29         pickle.load(open(loc+'_glove_words_idx.pkl', 'rb'))) 
30 
31 
32 ###############################################
33 print('==>> Loading the glove.6B.200d.txt files.')
34 en_vecs, en_wv_word, en_wv_idx = load_glove('./glove.6B.200d.txt')
35 en_w2v = {w: en_vecs[en_wv_idx[w]] for w in en_wv_word} 
36 n_en_vec, dim_en_vec = en_vecs.shape
37 
38 print('==>> shown one demo: "King"') 
39 demo_vector = en_w2v['king']
40 print(demo_vector)
41 print("==>> Done !")

View Code

Results:

wangxiao@AHU$ python tutorial_Glove_word2vec.py 
==>> begin to load Glove pre-trained models.
==>> save into .pkl files.
==>> Loading the glove.6B.200d.txt files.
==>> shown one demo: "King" 
[-0.49346 -0.14768 0.32166001 0.056899 0.052572 0.20192 -0.13506 -0.030793 0.15614 -0.23004 -0.66376001 -0.27316001 0.10391 0.57334 -0.032355 -0.32765999 -0.27160001 0.32918999
0.41305 -0.18085 1.51670003 2.16490006 -0.10278 0.098019 -0.018946 0.027292 -0.79479998 0.36631 -0.33151001 0.28839999 0.10436 -0.19166 0.27326 -0.17519 -0.14985999 -0.072333 
-0.54370999 -0.29728001 0.081491 -0.42673001 -0.36406001 -0.52034998 0.18455 0.44121 -0.32196 0.39172 0.11952 0.36978999 0.29229 -0.42954001 0.46653 -0.067243 0.31215999 -0.17216 
0.48874 0.28029999 -0.17577 -0.35100999 0.020792 0.15974 0.21927001 -0.32499 0.086022 0.38927001 -0.65638 -0.67400998 -0.41896001 1.27090001 0.20857 0.28314999 0.58238 -0.14944001 
0.3989 0.52680999 0.35714 -0.39100999 -0.55372 -0.56642002 -0.15762 -0.48004001 0.40448001 0.057518 -1.01569998 0.21754999 0.073296 0.15237001 -0.38361999 -0.75308001 -0.0060254 -0.26232001 
-0.54101998 -0.34347001 0.11113 0.47685 -0.73229998 0.77596998 0.015216 -0.66327 -0.21144 -0.42964 -0.72689998 -0.067968 0.50601 0.039817 -0.27584001 -0.34794 -0.0474 0.50734001 
-0.30777001 0.11594 -0.19211 0.3107 -0.60075003 0.22044 -0.36265001 -0.59442002 -1.20459998 0.10619 -0.60277998 0.21573 -0.35361999 0.55473 0.58094001 0.077259 1.0776 -0.1867 
-1.51680005 0.32418001 0.83332998 0.17366 1.12320006 0.10863 0.55888999 0.30799001 0.084318 -0.43178001 -0.042287 -0.054615 0.054712 -0.80914003 -0.24429999 -0.076909 0.55216002 -0.71895999 
0.83319002 0.020735 0.020472 -0.40279001 -0.28874001 0.23758 0.12576 -0.15165 -0.69419998 -0.25174001 0.29591 0.40290001 -1.0618 0.19847 -0.63463002 -0.70842999 0.067943 0.57366002  
0.041122 0.17452 0.19430999 -0.28641 -1.13629997 0.45116001 -0.066518 0.82615 -0.45451999 -0.85652 0.18105 -0.24187 0.20152999 0.72298002 0.17415 -0.87327999 0.69814998 0.024706 
0.26174 -0.0087155 -0.39348999 0.13801 -0.39298999 -0.23057 -0.22611 -0.14407 0.010511 -0.47389001 -0.15645 0.28601 -0.21772 -0.49535 0.022209 -0.23575 -0.22469001 -0.011578 0.52867001 -0.062309 ]
==>> Done !

View Code