-
SVD Based Methods
1.1 Word-Document Matrix
1.2 Window based CO-occurrence Matrix
In this method we count the number of times each word appears inside a window of a particular size around the word of interest. We calculate this count for all the words in corpus.
1.3 Advantages:Both of these methods give us word vectors that are more than sufficient to encode semantic and syntacic
1.4 Shortcomming:
★ The dimensions of the matrix change very often(new words are added very frequently and corpus changes in size)★ The matrix is extremely sparse since most words do not co-occur★ The matrix is very hign dimensional in general★ Quadratic cost to train(perform SVD)
-
Iteration Based Methods
2.1 CBOW Model
▴ key idea: Predicting a center word from the surrounding context▴ unkonwns: Two matrics,V∈Rn×|V| andU∈R|V|×n ▴ Notation for CBOW Model:wi :Word i from vocabulary VV∈Rn×|V| :Input word matrixvi :the input vector representation of wordwi U∈Rn×|V| :Output word matrixui :the output vector representation of wordwi
▴ Steps:We generate our one hot word vector(
x(c−m) ,…,x(c−1) ,x(c+1) ,…,x(c+m) ) for the input context of size m.We get our embedded word vectors for the context (
Vc−m=Vx(c−m) ,Vc−m+1=Vx(c−m+1) ,…Vc+m=Vx(c+m) )Average these vectors to get
v̂ =vc−m+vc−m+1+...+vc+m2m Generate a score vector
z=Uv̂ Turn the scores into probabilities
ŷ =sofrmax(z) We desire our probabilities generated,
ŷ ,to match the true probabilities,y,which also happens to be the one hot vector of the actual word
2.2 Skip-Gram Model
▴ key idea: Predicting surrounding context words given a center word▴ steps:We generate our one hot input vector x
We get our embedded word vectors for the context
vc=Vx Since there is no averaging,just set
v̂ =vc Generate 2m score vectors,
uc−m,...,uc−1,uc+1,...,uc+m usingu=Uvc Turn each of the scores into probabilitiesm,
y=softmax(u) We desire our probability vector generated to match the true probabilities which is
yc−m,...,yc−1,yc+1,...,yc+m ,the one hot vecotrs of the actual output