skip to content
gigarouter gigarouter
models / specialist model · coming soon

ColBERT v2

colbert-ir/colbertv2.0

published Jun 2023 · updated Apr 2024

ColBERT v2 is a retrieval model that uses contextualized late interaction for fast and accurate passage search over large text collections.

status
coming soon
API providers
0
downloads / mo
9.3M
license
mit

specs

TaskPassage Retrieval
ArchitectureBERT with late interaction (MaxSim)
Training DataMS MARCO Passage Ranking
SpeedTens of milliseconds per query (with PLAID engine)

about this model

ColBERTv2 is a retrieval model that uses contextualized late interaction over BERT to enable fast and accurate search over large text collections. It encodes each passage into a matrix of token-level embeddings and each query into another matrix, then efficiently scores fine-grained similarity between them using scalable vector-similarity (MaxSim) operators. This approach allows ColBERTv2 to surpass the quality of single-vector representation models while scaling efficiently.

Key Strengths

  • Fine-grained contextual late interaction for high retrieval accuracy.
  • Pre-computed document representations enable offline indexing and fast query processing.
  • ColBERTv2 reduces the space footprint of late interaction models by 6–10x compared to vanilla late interaction.
  • PLAID engine speeds up vanilla ColBERTv2 search latency by up to 7x on GPU and 45x on CPU while maintaining state-of-the-art retrieval quality.
  • Achieves latency of tens of milliseconds on GPU and tens to few hundred milliseconds on CPU at large scale (up to 140M passages).

Benchmark Results

ColBERTv2 has been evaluated on the LoTTE benchmark test set, yielding the following Success@5 scores:

Topic Search Queries Forum Queries
Writing80.176.3
Recreation72.370.8
Science56.746.1
Technology66.153.6
Lifestyle84.776.9
Pooled71.663.4

ColBERTv2 is competitive with existing BERT-based models and outperforms every non-BERT baseline, while executing two orders-of-magnitude faster and requiring four orders-of-magnitude fewer FLOPs per query. It has also been adapted for Open-Domain Question Answering (ColBERT-QA), attaining state-of-the-art extractive OpenQA performance on Natural Questions, SQuAD, and TriviaQA.

ColBERTv2 logo

Diagram of ColBERT's late interaction architecture showing query and passage token embeddings and MaxSim scoring

best for

FAQ

What is ColBERT v2 best used for?

It is best for fast, accurate passage retrieval in search and open-domain QA tasks, especially over large corpora.

How fast is ColBERT v2 at search time?

With the PLAID engine, it achieves tens of milliseconds on GPU and tens to few hundred milliseconds on CPU at scale.

What compression does ColBERT v2 use?

It uses 2-bit residual compression, reducing storage by 6-10x compared to vanilla late interaction.

How do I call ColBERT v2 via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending the query and optional parameters as per gigarouter documentation.

What input format does ColBERT v2 expect?

At inference, it accepts plain text queries; internally it uses token-level embeddings for late interaction scoring.

not yet live

We're benchmarking and onboarding ColBERT v2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related specialist model models

compare all →