ModernBERT Base

answerdotai/ModernBERT-base

published Dec 2024 · updated Jan 2025

ModernBERT Base is a fill-mask model that modernizes the BERT architecture with Rotary Positional Embeddings, local-global alternating attention, and a native 8,192 token context length, pre-trained on 2 trillion tokens of English and code data.

status

coming soon

API providers

downloads / mo

10.3M

license

apache-2.0

specs

Task	Fill-Mask
Architecture	Encoder-only Transformer with RoPE, Local-Global Alternating Attention, GeGLU activations
Parameters	149 million
Context Length	8,192 tokens
License	Apache 2.0

about this model

ModernBERT-base is a fill-mask model, a modernized bidirectional encoder-only Transformer (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. It leverages recent architectural improvements including Rotary Positional Embeddings (RoPE) for long-context support, local-global alternating attention for efficiency on long inputs, and unpadding with Flash Attention for efficient inference.

Key Strengths

ModernBERT-base (149M parameters) achieves strong results across natural language understanding, general retrieval, long-context retrieval, and code retrieval tasks. On GLUE, it scores 88.4, surpassing similarly-sized encoder models. For general retrieval (BEIR, DPR setting), it achieves 41.6, and for code retrieval, it sets new state-of-the-art results on CodeSearchNet (56.4) and StackQA (73.6). In multi-vector retrieval (ColBERT setting) on long-context out-of-domain data (MLDR_OOD), it reaches 80.2, significantly outperforming prior models.

Evaluation Results

Task	Metric	ModernBERT-base
GLUE	NLU	88.4
BEIR (DPR)	Retrieval	41.6
MLDR_ID (DPR)	Long-context retrieval	44.0
CodeSearchNet	Code retrieval	56.4
StackQA	Code retrieval	73.6
BEIR (ColBERT)	Multi-vector retrieval	51.3
MLDR_OOD (ColBERT)	Long-context multi-vector retrieval	80.2

Architecture and Training

The model uses a Pre-Norm Transformer with GeGLU activations, was pre-trained up to 1,024 tokens then extended to 8,192 tokens, and trained on 8x H100 GPUs. Training data is primarily English and code, so performance may be lower for other languages. The model is released under Apache 2.0 license.

best for

·Semantic search and retrieval over long documents
·Text classification and natural language understanding
·Code retrieval and hybrid text-code search
·Fill-mask for masked language modeling tasks

FAQ

What is ModernBERT Base best used for?

It excels at retrieval, classification, and semantic search, especially on long documents and code, due to its 8,192 token context and training on text and code.

How does ModernBERT Base compare to BERT-base in size and speed?

ModernBERT Base has 149M parameters (similar to BERT-base) but is more memory and speed efficient thanks to Flash Attention, unpadding, and modern architectural choices.

What is the license for ModernBERT Base?

It is released under the Apache 2.0 license.

What is the input format to use with this model?

Input is a text string containing a [MASK] token. Token type IDs are not used. Use the fill-mask pipeline or AutoModelForMaskedLM from Hugging Face transformers v4.48.0+.

How can I call ModernBERT Base via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model as "answerdotai/ModernBERT-base" and sending a prompt containing [MASK].

not yet live

We're benchmarking and onboarding ModernBERT Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.