Nemotron Labs TwoTower 30B A3B

nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

published Apr 2026 · updated Jul 2026

Nemotron Labs TwoTower 30B A3B is a text-generation model that uses block-wise autoregressive diffusion, generating text by iteratively denoising blocks of tokens in parallel.

status

coming soon

API providers

downloads / mo

10.3K

license

other

specs

Task	Text Generation
Architecture	Two-Tower Block-Diffusion over Mamba2-Transformer Hybrid Mixture of Experts (MoE)
Parameters	~60B total (30B AR/context tower + 30B diffusion/denoiser tower)
License	NVIDIA Nemotron Open Model License

about this model

Nemotron-Labs-TwoTower-30B-A3B-Base-BF16 is a block-wise autoregressive diffusion language model that generates text by iteratively denoising blocks of tokens in parallel rather than one token at a time. It is built on the NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 backbone and uses two towers: a frozen causal autoregressive context tower that processes the prompt and committed tokens, and a trainable diffusion/denoiser tower that generates blocks of tokens via mask diffusion with bidirectional in-block attention, layer-aligned cross-attention to the context tower, and context-seeded Mamba states. Both towers are copies of the same 52-layer hybrid Mamba-2/attention/MoE backbone; only the denoiser tower is trained.

Key Capabilities and Architecture

Generation modes: Mask diffusion (block-wise iterative denoising with confidence-based unmasking), Mock-AR (two-tower autoregressive), and standard AR (single tower).
Parameters: ~60B total (30B per tower), with ~3B active parameters per token via 128 routable experts (6 activated plus 2 shared).
Training: The denoiser was trained on ~2.1T tokens using a masked-diffusion objective, with the backbone pretrained on 25T tokens. Precision: BF16. Software: Megatron-LM.
Context length: Up to 128K tokens.

Benchmark Performance

At the default operating point (confidence threshold γ=0.8, block size S=16, BF16 on 2×H100 GPUs), the model retains 98.7% of the autoregressive baseline's aggregate benchmark quality and achieves 2.42× the baseline's wall-clock generation throughput.

Task	AR Baseline	TwoTower (Diffusion)
MMLU (5-shot, acc)	78.56	78.24
MMLU-Pro (5-shot, CoT EM)	62.59	60.93
ARC-Challenge (25-shot, acc_norm)	91.72	92.66
WinoGrande (5-shot, acc)	76.09	76.09
RACE (0-shot, acc)	88.90	88.90
HumanEval (0-shot)	79.27	75.58
MBPP-Sanitized (3-shot)	74.71	74.28
GSM8K (8-shot, acc)	92.49	90.14
MATH-500 (4-shot)	84.40	80.60
MMLU Global Lite (5-shot, avg acc)	73.97	73.94
MGSM (8-shot, avg acc)	80.80	80.40

How Mask Diffusion Works

Generation is block-wise autoregressive: the context tower encodes the prompt once, then the denoiser fills one block of block_size positions at a time. For each new block, it initializes all positions as [MASK], then for steps_per_block iterations computes the diffusion timestep, runs the denoiser with bidirectional in-block attention and cross-attention to the context cache, and commits high-confidence positions. Multiple tokens may be committed per step.

Category-level comparison between the Nemotron-3-Nano-30B-A3B autoregressive baseline and Nemotron-Labs-TwoTower Diffusion.

The model is governed by the NVIDIA Nemotron Open Model License Agreement. Developed by NVIDIA Corporation (September 2025 – April 2026), with pre-training data cutoff June 25, 2025.

best for

·Fast parallel text generation with up to 2.42x throughput vs. autoregressive models
·Code and math reasoning tasks (HumanEval, GSM8K, MATH)
·Long-form generation up to 128K tokens

FAQ

What is the architecture of Nemotron Labs TwoTower?

It uses two towers: a frozen autoregressive context tower and a trained diffusion/denoiser tower, both copies of a 52-layer hybrid Mamba-2/attention/MoE backbone. The denoiser generates blocks of tokens via mask diffusion with bidirectional in-block attention and cross-attention to the context tower.

How does it compare to standard autoregressive models?

At the default operating point, it retains 98.7% of the autoregressive baseline's quality while providing 2.42x the wall-clock generation throughput, by committing multiple tokens per step.

What license governs use of this model?

It is governed by the NVIDIA Nemotron Open Model License Agreement, which permits commercial use.

What are the input and output formats?

Input and output are text strings, one-dimensional sequences, with a maximum length of 128K tokens.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible gigarouter endpoint with your API key, specifying the model name and your prompt. Refer to gigarouter documentation for request format.

not yet live

We're benchmarking and onboarding Nemotron Labs TwoTower 30B A3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

9.2M dl/mo

deepseek-v4-gguf

6.4M dl/mo

Qwen3.6-35B-A3B-NVFP4

6.2M dl/mo

gemma-3-270m

5.1M dl/mo