skip to content
gigarouter gigarouter
models / text generation · coming soon

Ornith 1.0 35B AEON Ultimate Uncensored

AEON-7/Ornith-1.0-35B-AEON-Ultimate-Uncensored-NVFP4

published Jun 2026 · updated Jun 2026

Ornith 1.0 35B AEON Ultimate Uncensored is an uncensored 4-bit NVFP4 quantized Mixture-of-Experts text-generation model optimized for agentic coding and reasoning.

status
coming soon
API providers
0
downloads / mo
8.7K
license
mit

specs

TaskText Generation
ArchitectureMixture of Experts (MoE) with GatedDeltaNet and attention
Parameters35B total
QuantizationNVFP4 (4-bit weight-only) / W4A16
LicenseCustom (see model card for terms)

about this model

AEON-7/Ornith-1.0-35B-AEON-Ultimate-Uncensored-NVFP4 is a text-generation model that provides a near-lossless 4-bit quantized version of the state-of-the-art agentic-coding MoE, with refusals removed. It uses weight-only NVFP4 (W4A16) to reduce size to 23.7 GB while preserving attention and SSM components in BF16. The model is validated to be coherent and identical in coding capability to its BF16 parent.

Validation

MetricBF16 parentThis NVFP4
Agentic/coding pass@1 (18-task probe)0.8330.833 — identical
Refusals (diverse harmful prompts)00 — abliteration fully survived FP4
Coherence (benign + harmful, long gen)clean0/10 degenerate

Performance on DGX Spark (GB10)

Plain decode (NVFP4, no speculative decoding):

Concurrencyc=1c=8c=16c=32
aggregate tok/s39221376539

With DFlash speculative decoding (optimal n=6): single-stream decode reaches 73.8 tok/s (1.89× over plain).

Combined gains vs a naive BF16 deploy:

WorkloadBF16 · stock vLLMNVFP4 · stock vLLMNVFP4 + DFlash · AEON
Coding30.8 tok/s · 237 ms38.5 · 70 ms77.1 · 94 ms
Reasoning30.6 · 247 ms38.4 · 77 ms107.0 · 93 ms
Math30.5 · 221 ms38.3 · 72 ms119.0 · 88 ms
Prose30.4 · 193 ms38.3 · 69 ms70.3 · 91 ms
Avg decode30.638.493.3
Prefill3,517 tok/s5,2039,661
Decode throughput by category Time to first token by category

Prefix Caching

Repeated context benefits from prefix caching; cache-hit prefill remains ~100–200 ms while cold prefill grows with context length. Validated coherent and composable with DFlash.

ContextNo cache (re-prefill)Cache hitSpeedup
1.2k422 ms86 ms4.9×
4.9k869 ms154 ms5.6×
9.7k1,716 ms103 ms16.6×
14.6k2,588 ms202 ms12.8×
Prefix caching scaling

This is an uncensored model: safety refusals have been removed, placing full responsibility for appropriate use on the user. It requires a Blackwell GPU (B200 or GB10) and is hosted as a managed API via gigarouter.

best for

FAQ

What hardware is required to run this model?

A Blackwell GPU (B200, B100, or GB10) is required because the NVFP4 quantization relies on Blackwell's hardware support.

Is the model truly uncensored?

Yes, safety refusals have been completely removed through abliteration, and the 4-bit quantization preserved that behavior. It will generate any content instructed, without internal refusal.

How much faster is NVFP4 compared to the BF16 version?

On a DGX Spark, BF16 yields ~30.6 tok/s decode; NVFP4 stock gives ~38.4 tok/s (1.25x) and ~72 ms TTFT (3.2x faster than BF16's ~230 ms).

Can I use speculative decoding to speed up inference?

Yes, DFlash speculative decoding with a Qwen3.6-35B-A3B drafter achieves up to 1.89x single-stream decode (73.8 tok/s) at n=6 speculative tokens, but requires BF16 KV and a capped max-num-seqs of ~16 for stability.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your own API key, specifying the model name as "AEON-7/Ornith-1.0-35B-AEON-Ultimate-Uncensored-NVFP4" or a configured alias.

not yet live

We're benchmarking and onboarding Ornith 1.0 35B AEON Ultimate Uncensored as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

compare all →