Why do the output layer weights become word vectors in Word2Vec? [D]

I'm trying to understand the intuition behind Word2Vec training using a neural network.

In Word2Vec (CBOW or Skip-gram), we often hear that the weight matrices learned during training contain the vector representations (embeddings) of words. However, I don't understand why the weights of the hidden-to-output layer (or output weight matrix) end up representing semantic features of words.

Why do these weights become meaningful vector representations instead of just being parameters used to make predictions?

I've explored multiple YouTube videos, blog posts and even asked ChatGPT several times, but I still haven't found an explanation that truly clicks for me. Most resources explain that the weights become embeddings, but not why this happens intuitively and mathematically.

Could someone provide a clear intuition or mathematical explanation of why the output-layer weights end up encoding semantic information about words?

Any good resources that explain this particularly well would also be appreciated.

submitted by /u/aaryantiwari26
[link] [comments]

Want to read more?