skip to content
gigarouter gigarouter
models / text generation · coming soon

Qwen3.6 27B NVFP4

nvidia/Qwen3.6-27B-NVFP4

published Jun 2026 · updated Jun 2026

Qwen3.6 27B NVFP4 is a text-generation model that is a quantized version of Alibaba's Qwen3.6-27B, optimized for efficient deployment with vLLM on NVIDIA GPUs.

status
coming soon
API providers
0
downloads / mo
94.5K
license
apache-2.0

specs

TaskText Generation
ArchitectureHybrid Attention (Gated DeltaNet and Gated Attention)
Parameters27B
LicenseApache 2.0
Input TypesText, Image, Video
Context LengthUp to 262K tokens

about this model

nvidia/Qwen3.6-27B-NVFP4 is a text-generation model derived from Alibaba's Qwen3.6-27B, quantized to NVFP4 precision using NVIDIA Model Optimizer. It is a 27B-parameter auto-regressive language model with a hybrid attention architecture (Gated DeltaNet and Gated Attention) and supports a context length of up to 262,144 tokens. The model accepts text, image, and video inputs and generates text output. Designed for commercial or non-commercial use under the Apache 2.0 license.

Quantization and Performance

Weights and activations of linear operators within transformer blocks are quantized to the NVFP4 data type, reducing bits per parameter from 16 to 4. This yields approximately 2.5x reduction in disk size and GPU memory requirements while preserving accuracy. The model is optimized for NVIDIA Hopper and Blackwell GPUs and is served through the vLLM inference engine.

Accuracy benchmarks demonstrate that NVFP4 quantization introduces negligible degradation compared to the FP8 baseline, as shown below (benchmarked with temperature=1.0, top_p=0.95, max_num_tokens=81920 unless otherwise noted):

PrecisionMMLU ProGPQA DiamondHLEτ²-Bench TelecomMMMU ProSciCodeAIME 2025AA-LCRIFBench
FP886.186.021.795.274.644.893.168.865.1
NVFP486.385.521.895.474.344.592.768.365.5

Baseline: Qwen3.6-27B-FP8. SciCode uses temperature=0.6; τ²-Bench Telecom uses temperature=0.0, top_p=1.0.

Deployment on Gigarouter

Gigarouter hosts this model as a managed, OpenAI-compatible API. Users send requests directly without managing infrastructure or performing local quantization. The model's optimizations enable efficient inference with minimal latency and memory footprint.

best for

FAQ

What is this model best for?

It is best for AI agent systems, chatbots, RAG pipelines, and long-context reasoning tasks requiring up to 262K tokens of input.

How does this quantized version compare to the original FP8 version?

The NVFP4 quantization reduces memory and disk size by ~2.5x while maintaining accuracy within 0.5–1% on most benchmarks, as shown in the evaluation table.

What are the input and output formats?

Inputs can be text, image (RGB), or video (MP4/WebM). Output is text. The model accepts a string and returns a string via the API.

What license is it under?

The model is governed by the Apache 2.0 license, allowing commercial and non-commercial use.

How do I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Set the model name to "nvidia/Qwen3.6-27B-NVFP4" in your requests.

not yet live

We're benchmarking and onboarding Qwen3.6 27B NVFP4 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

compare all →