Text Classification
What it is
Text classification assigns one or more categorical labels to documents or sentences. Common tasks include sentiment analysis (positive/negative), topic classification (sports/politics/tech), spam detection, and intent recognition (for chatbots). Classification is one of the most common NLP tasks, solved efficiently with modern neural models.
[illustrate: Text document → BERT encoder → classification head → probability distribution over classes; example showing sentiment scores]
How it works
-
Input: Text document or sentence
-
Encoding: Represent text as vector(s)
- Bag-of-words
- TF-IDF
- Word embedding average
- BERT [CLS] token
-
Classification:
- Linear layer: embedding → logits
- Softmax: logits → probabilities
- Argmax or threshold: probabilities → classes
-
Output: Class labels (often with confidence scores)
Example
# Sentiment analysis (binary: positive/negative)
Text: "This movie was amazing!"
BERT encoding → [CLS] token embedding
Classification head:
logits = W_class @ [CLS] + b
probs = softmax(logits)
output: positive (prob=0.95)
# Multi-class: topic classification
Text: "The Lakers won the championship"
Classes: [sports, politics, tech, entertainment]
Output: sports (prob=0.92)
# Multi-label: genre tagging
Text: "A romantic comedy with sci-fi elements"
Output: [romance (0.88), comedy (0.85), sci-fi (0.72)]
Variants and history
Text classification is foundational, emerging in the 1990s. Early methods used Naive Bayes and SVMs with bag-of-words. Neural text classification (CNN, RNN, 2014+) improved over hand-crafted features. Transfer learning with BERT (2018+) achieved strong results with minimal task-specific training. Variants include hierarchical classification (category hierarchy), zero-shot classification (unseen classes), and few-shot learning (small training sets).
When to use it
Use text classification for:
- Sentiment analysis (reviews, social media)
- Topic classification (news routing)
- Intent recognition (chatbots, assistants)
- Spam/abuse detection
- Language identification
- Content moderation
Classification is efficient and reliable. Most systems fine-tune BERT or use instruction-tuned LLMs. For simple baseline or interpretability, logistic regression on TF-IDF still effective.