Word2Vec
What it is
Word2Vec is a two-layer neural network that learns dense word embeddings from large unlabeled text corpora. Published by Mikolov et al. (2013), it introduced two training objectives—skip-gram and CBOW (Continuous Bag of Words)—that are computationally efficient enough to train on billion-token corpora.
[illustrate: Diagrams of skip-gram architecture (target word → context words) and CBOW architecture (context words → target word), with embedding layer highlighted]
How it works
Skip-gram: Given a target word, predict surrounding context words within a window. The model learns to maximize the probability of observed contexts.
CBOW: Given context words in a window, predict the target word at the center.
Both use:
- Embedding layer: maps tokens to dense vectors
- Negative sampling or hierarchical softmax: avoids expensive full softmax over vocabulary
- Optimization: stochastic gradient descent with adaptive learning rates
Skip-gram generally performs better on large corpora; CBOW is faster and better for smaller datasets.
Example
After training on a corpus:
# Skip-gram: given "jumped", predict ["The", "cat", "quickly"]
# CBOW: given ["The", "cat", "quickly"], predict "jumped"
vector("king").similarity(vector("queen")) = 0.85
vector("Paris") - vector("France") + vector("Japan") ≈ vector("Tokyo")
Variants and history
Word2Vec appeared in 2013 as two papers on ArXiv. Gensim’s open-source implementation made it widely accessible. Subsequent work refined the architecture: fastText added character n-grams for subword semantics; GloVe combined global statistics with local context; contextual models (BERT, ELMo) learned position-dependent embeddings. Word2Vec remains the conceptual foundation for modern embeddings.
When to use it
Use Word2Vec for:
- Quick semantic similarity without retraining
- Lightweight transfer learning
- Exploration of word relationships
- Baseline for downstream tasks
- Interpretable vector representations
Word2Vec is fast and interpretable but context-agnostic (homonyms have single vectors). For tasks requiring syntactic precision or domain adaptation, consider contextual embeddings or fine-tuned models.