Span Query
What it is
Span queries are Lucene primitives for matching documents based on precise token position constraints. They form a query algebra enabling composition of positional constraints more flexibly than phrase queries.
How it works
Span query types include:
- SpanTermQuery: matches a single term at any position
- SpanNearQuery: matches multiple spans within specified distance
- SpanFirstQuery: matches spans starting within first N positions
- SpanNotQuery: matches spans excluding specific spans
- SpanOrQuery: union of span matches
- SpanPayloadQuery: matches spans with associated metadata (payloads)
Execution retrieves position information from the index, iterates through matching spans, and applies positional constraints.
[illustrate: SpanNearQuery composition for phrase-like matching with gap constraints]
Example
SpanNearQuery([SpanTermQuery("information"),
SpanTermQuery("retrieval")],
slop=0, inOrder=true)
Matches “information retrieval” at positions (n, n+1) exactly.
Variants and history
Introduced in Lucene to provide flexible positional matching beyond phrase queries. Many IR systems (Galago, others) offer similar span-based query algebras. Enables research-level positional queries unavailable in standard boolean retrieval.
When to use it
For research and advanced IR applications requiring precise positional constraints. More expressive than phrase queries; more complex to construct. Requires positional indices; adds query evaluation overhead. Rarely used directly in production; often used by researchers and advanced users.