skip to content
gigarouter gigarouter
models / fill mask · coming soon

ModernBERT Base

answerdotai/ModernBERT-base

published Dec 2024 · updated Jan 2025

ModernBERT Base is a fill-mask model that modernizes the BERT architecture with Rotary Positional Embeddings, local-global alternating attention, and a native 8,192 token context length, pre-trained on 2 trillion tokens of English and code data.

status
coming soon
API providers
0
downloads / mo
10.3M
license
apache-2.0

specs

TaskFill-Mask
ArchitectureEncoder-only Transformer with RoPE, Local-Global Alternating Attention, GeGLU activations
Parameters149 million
Context Length8,192 tokens
LicenseApache 2.0

about this model

ModernBERT-base is a fill-mask model, a modernized bidirectional encoder-only Transformer (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. It leverages recent architectural improvements including Rotary Positional Embeddings (RoPE) for long-context support, local-global alternating attention for efficiency on long inputs, and unpadding with Flash Attention for efficient inference.

Key Strengths

ModernBERT-base (149M parameters) achieves strong results across natural language understanding, general retrieval, long-context retrieval, and code retrieval tasks. On GLUE, it scores 88.4, surpassing similarly-sized encoder models. For general retrieval (BEIR, DPR setting), it achieves 41.6, and for code retrieval, it sets new state-of-the-art results on CodeSearchNet (56.4) and StackQA (73.6). In multi-vector retrieval (ColBERT setting) on long-context out-of-domain data (MLDR_OOD), it reaches 80.2, significantly outperforming prior models.

Evaluation Results

TaskMetricModernBERT-base
GLUENLU88.4
BEIR (DPR)Retrieval41.6
MLDR_ID (DPR)Long-context retrieval44.0
CodeSearchNetCode retrieval56.4
StackQACode retrieval73.6
BEIR (ColBERT)Multi-vector retrieval51.3
MLDR_OOD (ColBERT)Long-context multi-vector retrieval80.2

Architecture and Training

The model uses a Pre-Norm Transformer with GeGLU activations, was pre-trained up to 1,024 tokens then extended to 8,192 tokens, and trained on 8x H100 GPUs. Training data is primarily English and code, so performance may be lower for other languages. The model is released under Apache 2.0 license.

best for

FAQ

What is ModernBERT Base best used for?

It excels at retrieval, classification, and semantic search, especially on long documents and code, due to its 8,192 token context and training on text and code.

How does ModernBERT Base compare to BERT-base in size and speed?

ModernBERT Base has 149M parameters (similar to BERT-base) but is more memory and speed efficient thanks to Flash Attention, unpadding, and modern architectural choices.

What is the license for ModernBERT Base?

It is released under the Apache 2.0 license.

What is the input format to use with this model?

Input is a text string containing a [MASK] token. Token type IDs are not used. Use the fill-mask pipeline or AutoModelForMaskedLM from Hugging Face transformers v4.48.0+.

How can I call ModernBERT Base via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model as "answerdotai/ModernBERT-base" and sending a prompt containing [MASK].

not yet live

We're benchmarking and onboarding ModernBERT Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.