Span Query

What it is

Span queries are Lucene primitives for matching documents based on precise token position constraints. They form a query algebra enabling composition of positional constraints more flexibly than phrase queries.

How it works

Span query types include:

  • SpanTermQuery: matches a single term at any position
  • SpanNearQuery: matches multiple spans within specified distance
  • SpanFirstQuery: matches spans starting within first N positions
  • SpanNotQuery: matches spans excluding specific spans
  • SpanOrQuery: union of span matches
  • SpanPayloadQuery: matches spans with associated metadata (payloads)

Execution retrieves position information from the index, iterates through matching spans, and applies positional constraints.

[illustrate: SpanNearQuery composition for phrase-like matching with gap constraints]

Example

SpanNearQuery([SpanTermQuery("information"),
               SpanTermQuery("retrieval")],
              slop=0, inOrder=true)

Matches “information retrieval” at positions (n, n+1) exactly.

Variants and history

Introduced in Lucene to provide flexible positional matching beyond phrase queries. Many IR systems (Galago, others) offer similar span-based query algebras. Enables research-level positional queries unavailable in standard boolean retrieval.

When to use it

For research and advanced IR applications requiring precise positional constraints. More expressive than phrase queries; more complex to construct. Requires positional indices; adds query evaluation overhead. Rarely used directly in production; often used by researchers and advanced users.

See also