Word Embedding
What it is
A word embedding is a learned dense vector representation of a word in continuous space, typically 50–300 dimensions. Unlike one-hot or sparse representations, embeddings encode semantic meaning so that words with similar contexts occupy nearby positions. The distance or similarity between vectors reflects linguistic relationships.
[illustrate: 2D projection of word embeddings showing “king” near “queen”, “man” near “woman”, “paris” near “france”]
How it works
Embeddings are learned by training neural networks on large corpora with unsupervised objectives:
- Skip-gram: predict context words from target word
- CBOW: predict target from context words
- Co-occurrence: factorize term co-occurrence matrices
- Language modeling: predict next token in sequence
The learned weights from the embedding layer become the word vectors. Training optimizes for two principles: (1) similar contexts yield similar vectors, and (2) frequent words are distinguished from rare ones through sampling or subsampling.
Example
After training on Wikipedia, vectors exhibit compositionality:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")cosine_similarity(vector("dog"), vector("puppy"))= 0.89cosine_similarity(vector("dog"), vector("spaceship"))= 0.12
Variants and history
Word embeddings emerged in the early 2010s with neural language models. Word2Vec (Mikolov et al., 2013) popularized efficient skip-gram training. GloVe (Pennington et al., 2014) combined global matrix factorization with local context. fastText (Bojanowski et al., 2017) extended embeddings to character n-grams for subword information and out-of-vocabulary handling. Contextual embeddings like BERT and GPT produce token-specific embeddings based on surrounding text.
When to use it
Use word embeddings for:
- Semantic similarity and clustering
- Cold-start recommendations
- Nearest-neighbour exploration
- Input to downstream neural models
- Lightweight semantic search
Embeddings are fast and interpretable but context-agnostic. For context-sensitive tasks (disambiguation, syntax-heavy problems), prefer contextual models.