Word2Vec

Embedding Word2vec Skip-Gram Cbow Mikolov Needs-Review

What it is

Word2Vec is a two-layer neural network that learns dense word embeddings from large unlabeled text corpora. Published by Mikolov et al. (2013), it introduced two training objectives—skip-gram and CBOW (Continuous Bag of Words)—that are computationally efficient enough to train on billion-token corpora.

[illustrate: Diagrams of skip-gram architecture (target word → context words) and CBOW architecture (context words → target word), with embedding layer highlighted]

How it works

Skip-gram: Given a target word, predict surrounding context words within a window. The model learns to maximize the probability of observed contexts.

CBOW: Given context words in a window, predict the target word at the center.

Both use:

Embedding layer: maps tokens to dense vectors
Negative sampling or hierarchical softmax: avoids expensive full softmax over vocabulary
Optimization: stochastic gradient descent with adaptive learning rates

Skip-gram generally performs better on large corpora; CBOW is faster and better for smaller datasets.

Example

After training on a corpus:

# Skip-gram: given "jumped", predict ["The", "cat", "quickly"]
# CBOW: given ["The", "cat", "quickly"], predict "jumped"

vector("king").similarity(vector("queen")) = 0.85
vector("Paris") - vector("France") + vector("Japan") ≈ vector("Tokyo")

Variants and history

Word2Vec appeared in 2013 as two papers on ArXiv. Gensim’s open-source implementation made it widely accessible. Subsequent work refined the architecture: fastText added character n-grams for subword semantics; GloVe combined global statistics with local context; contextual models (BERT, ELMo) learned position-dependent embeddings. Word2Vec remains the conceptual foundation for modern embeddings.

When to use it

Use Word2Vec for:

Quick semantic similarity without retraining
Lightweight transfer learning
Exploration of word relationships
Baseline for downstream tasks
Interpretable vector representations

Word2Vec is fast and interpretable but context-agnostic (homonyms have single vectors). For tasks requiring syntactic precision or domain adaptation, consider contextual embeddings or fine-tuned models.