Word Embedding

What it is

A word embedding is a learned dense vector representation of a word in continuous space, typically 50–300 dimensions. Unlike one-hot or sparse representations, embeddings encode semantic meaning so that words with similar contexts occupy nearby positions. The distance or similarity between vectors reflects linguistic relationships.

[illustrate: 2D projection of word embeddings showing “king” near “queen”, “man” near “woman”, “paris” near “france”]

How it works

Embeddings are learned by training neural networks on large corpora with unsupervised objectives:

Skip-gram: predict context words from target word
CBOW: predict target from context words
Co-occurrence: factorize term co-occurrence matrices
Language modeling: predict next token in sequence

The learned weights from the embedding layer become the word vectors. Training optimizes for two principles: (1) similar contexts yield similar vectors, and (2) frequent words are distinguished from rare ones through sampling or subsampling.

Example

After training on Wikipedia, vectors exhibit compositionality:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")
cosine_similarity(vector("dog"), vector("puppy")) = 0.89
cosine_similarity(vector("dog"), vector("spaceship")) = 0.12

Variants and history

Word embeddings emerged in the early 2010s with neural language models. Word2Vec (Mikolov et al., 2013) popularized efficient skip-gram training. GloVe (Pennington et al., 2014) combined global matrix factorization with local context. fastText (Bojanowski et al., 2017) extended embeddings to character n-grams for subword information and out-of-vocabulary handling. Contextual embeddings like BERT and GPT produce token-specific embeddings based on surrounding text.

When to use it

Use word embeddings for:

Semantic similarity and clustering
Cold-start recommendations
Nearest-neighbour exploration
Input to downstream neural models
Lightweight semantic search

Embeddings are fast and interpretable but context-agnostic. For context-sensitive tasks (disambiguation, syntax-heavy problems), prefer contextual models.