skip to content
gigarouter gigarouter
models / text generation · coming soon

Ornith 1.0 9B MTP

protoLabsAI/Ornith-1.0-9B-MTP-GGUF

published Jun 2026 · updated Jul 2026

Ornith 1.0 9B MTP is a text-generation model that uses a multi-token prediction (MTP) head for lossless speculative decoding, achieving up to 1.73x speedup on llama.cpp.

status
coming soon
API providers
0
downloads / mo
16.8K
license
mit

specs

TaskText Generation
ArchitectureQwen3.5-9B hybrid (linear-attention + full-attention) with MTP head
Parameters9B
LicenseMIT

about this model

Ornith-1.0-9B-MTP-GGUF is a text-generation model that bundles a 9B-parameter Qwen3.5-9B hybrid (linear-attention + full-attention) fine-tune with a KL-distilled multi-token prediction (MTP) draft head, enabling lossless self-speculative decoding in llama.cpp without a separate draft model.

Speculative Decoding Performance

On a single RTX A6000 (ctx 8192, flash-attn, greedy), the model achieves the following single-stream decode speedups with the MTP head:

ConfigDecode tok/sAcceptanceSpeedup
Base (no MTP)71.01.00×
MTP n-max 2118.30.7661.67×
MTP n-max 3122.60.6511.73×
MTP n-max 4120.80.5651.70×

Acceptance is quant-stable: across Q4_K_M and Q8_0 at n-max 3, acceptance remains ~0.65. The relative speedup grows with precision (Q8_0’s bandwidth-bound baseline gains 1.73× vs Q4_K_M’s 1.38×). The KL-distilled head achieves 0.765 acceptance on coding prompts and 0.762 on general corpus (vLLM reference: 0.762).

NVFP4 Quantization for Blackwell

A dedicated NVFP4 build (6.6 GB) on Blackwell hardware (RTX 50xx / PRO 6000) delivers the fastest inference in the repository:

FileSizeNo-MTP tok/s+MTP tok/sPlatform
Q4_K_M5.8 GB104.6153.4Ampere A6000
NVFP4-MTP6.6 GB70.784.8Ampere A6000
Q4_K_M5.8 GB205.1239 (216–252)Blackwell
NVFP4-MTP6.6 GB201.5306 (287–330)Blackwell

On Blackwell, NVFP4+MTP is ~28% faster than Q4_K_M+MTP due to near-zero verify-cost overhead on tensor-core GEMMs. Draft acceptance is near-equal between the two files (0.52 vs 0.49). On Ampere and older GPUs, Q4_K_M remains smaller and faster.

Quality and Distribution-Losslessness

Speculative decoding is distribution-lossless: every drafted token is verified against the target, preserving the output distribution. (Small non-bitwise differences can occur at greedy sampling due to floating-point reduction order — both outputs are equally valid.) Quantizing the target to NVFP4 does not degrade the draft head: acceptance remains 0.76 on real text vs 0.762 on BF16. The NVFP4 quant also scores a pass rate of 0.96 on function_call (baseline BF16 0.93) and maintains coherence through 60K context.

best for

FAQ

What speedup does the MTP head provide?

Up to 1.73x over base decode on llama.cpp, measured with Q8_0 and n-max 3. On Q4_K_M speedup is ~1.38x.

How is this different from using a separate draft model?

The MTP head is baked into the single GGUF file, so no separate draft model is needed. It performs lossless self-speculative decoding.

What are the license terms?

MIT license — both the base model and MTP head are MIT. The GGUF builds are a derivative, also MIT.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. The model name is the name shown on gigarouter, pass it as the model parameter.

What quantized formats are available?

GGUF quants: BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS, IQ3_M, IQ2_M, and an NVFP4 variant for Blackwell GPUs.

not yet live

We're benchmarking and onboarding Ornith 1.0 9B MTP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

compare all →