Word Vector Representation

SVD Based Methods

1.1 Word-Document Matrix

1.2 Window based CO-occurrence Matrix

In this method we count the number of times each word appears inside a window of a particular size around the word of interest. We calculate this count for all the words in corpus.

1.3 Advantages:Both of these methods give us word vectors that are more than sufficient to encode semantic and syntacic

1.4 Shortcomming:

★ The dimensions of the matrix change very often(new words are added very frequently and corpus changes in size)

★ The matrix is extremely sparse since most words do not co-occur

★ The matrix is very hign dimensional in general

★ Quadratic cost to train(perform SVD)

Iteration Based Methods

2.1 CBOW Model

▴ key idea: Predicting a center word from the surrounding context

▴ unkonwns: Two matrics,V∈Rn×|V| and U∈R|V|×n

▴ Notation for CBOW Model:
- wi:Word i from vocabulary V
- V∈Rn×|V|:Input word matrix
- vi:the input vector representation of word wi
- U∈Rn×|V|:Output word matrix
- ui:the output vector representation of word wi
▴ Steps:
- We generate our one hot word vector(x(c−m),…,x(c−1),x(c+1),…,x(c+m)) for the input context of size m.
- We get our embedded word vectors for the context (Vc−m=Vx(c−m),Vc−m+1=Vx(c−m+1),…Vc+m=Vx(c+m))
- Average these vectors to get v̂ =vc−m+vc−m+1+...+vc+m2m
- Generate a score vector z=Uv̂
- Turn the scores into probabilities ŷ =sofrmax(z)
- We desire our probabilities generated,ŷ ,to match the true probabilities,y,which also happens to be the one hot vector of the actual word
2.2 Skip-Gram Model

▴key idea: Predicting surrounding context words given a center word

▴steps:
- We generate our one hot input vector x
- We get our embedded word vectors for the context vc=Vx
- Since there is no averaging,just set v̂ =vc
- Generate 2m score vectors,uc−m,...,uc−1,uc+1,...,uc+m using u=Uvc
- Turn each of the scores into probabilitiesm, y=softmax(u)
- We desire our probability vector generated to match the true probabilities which is yc−m,...,yc−1,yc+1,...,yc+m,the one hot vecotrs of the actual output