Text Classification

What it is

Text classification assigns one or more categorical labels to documents or sentences. Common tasks include sentiment analysis (positive/negative), topic classification (sports/politics/tech), spam detection, and intent recognition (for chatbots). Classification is one of the most common NLP tasks, solved efficiently with modern neural models.

[illustrate: Text document → BERT encoder → classification head → probability distribution over classes; example showing sentiment scores]

How it works

Input: Text document or sentence
Encoding: Represent text as vector(s)
- Bag-of-words
- TF-IDF
- Word embedding average
- BERT [CLS] token
Classification:
- Linear layer: embedding → logits
- Softmax: logits → probabilities
- Argmax or threshold: probabilities → classes
Output: Class labels (often with confidence scores)

Example

# Sentiment analysis (binary: positive/negative)
Text: "This movie was amazing!"
BERT encoding → [CLS] token embedding
Classification head:
  logits = W_class @ [CLS] + b
  probs = softmax(logits)
  output: positive (prob=0.95)

# Multi-class: topic classification
Text: "The Lakers won the championship"
Classes: [sports, politics, tech, entertainment]
Output: sports (prob=0.92)

# Multi-label: genre tagging
Text: "A romantic comedy with sci-fi elements"
Output: [romance (0.88), comedy (0.85), sci-fi (0.72)]

Variants and history

Text classification is foundational, emerging in the 1990s. Early methods used Naive Bayes and SVMs with bag-of-words. Neural text classification (CNN, RNN, 2014+) improved over hand-crafted features. Transfer learning with BERT (2018+) achieved strong results with minimal task-specific training. Variants include hierarchical classification (category hierarchy), zero-shot classification (unseen classes), and few-shot learning (small training sets).

When to use it

Use text classification for:

Sentiment analysis (reviews, social media)
Topic classification (news routing)
Intent recognition (chatbots, assistants)
Spam/abuse detection
Language identification
Content moderation

Classification is efficient and reliable. Most systems fine-tune BERT or use instruction-tuned LLMs. For simple baseline or interpretability, logistic regression on TF-IDF still effective.