Sequence-to-Sequence

What it is

Sequence-to-Sequence (Seq2Seq) is an encoder-decoder neural architecture that maps variable-length input sequences to variable-length output sequences. Seq2Seq models handle tasks where input and output structure differ fundamentally: machine translation, summarization, paraphrasing, dialogue, and code-to-documentation.

[illustrate: Encoder processing input sequence; decoder generating output token-by-token; attention connections between encoder and decoder]

How it works

Encoder:
- Processes input sequence (e.g., text to translate)
- Produces context vector(s) or representations
- Often bidirectional (processes left-to-right and right-to-left)
Decoder:
- Generates output sequence token-by-token
- Causal attention (can attend to past output tokens)
- Cross-attention to encoder outputs (allows focusing on relevant input parts)
Training:
- Teacher forcing: during training, feed true previous target tokens as input
- At inference, feed generated tokens (autoregressive generation)
Loss: Cross-entropy loss on each target token

Example

# Machine translation (English → French)
Input: "Hello, how are you?"
Encoder: Process input, produce representations
Decoder: Generate French token-by-token
  Step 1: <start> → "Bonjour"
  Step 2: <start>, "Bonjour" → ","
  Step 3: ... → "comment"
  Step 4: ... → "allez-vous?"
Output: "Bonjour, comment allez-vous?"

# Summarization
Input: "[long document]"
Output: "[short summary]"

Variants and history

Seq2Seq originated with RNN encoder-decoder (Cho et al., 2014; Sutskever et al., 2014). LSTM-based seq2seq improved over simpler RNNs. Attention mechanism (Bahdanau et al., 2014) solved bottleneck of encoding entire input in single vector. Transformer seq2seq (Vaswani et al., 2017) enabled parallel training. Pretrained seq2seq (T5, BART, 2019–2020) fine-tune on diverse tasks. Decoder-only models (GPT-style) partially replaced traditional seq2seq.

When to use it

Use seq2seq when:

Input and output structure differ (translation, summarization)
Variable-length sequences are natural
Attention over input is beneficial
Pre-trained models (T5, BART) available for your language pair/task
Fine-tuning on task data is feasible

Modern seq2seq relies on pre-trained models and fine-tuning. From-scratch training is expensive. Decoder-only models (GPT-style) now competitive for many seq2seq tasks.