Qwen3.6 27B NVFP4
nvidia/Qwen3.6-27B-NVFP4
published Jun 2026 · updated Jun 2026
Qwen3.6 27B NVFP4 is a text-generation model that is a quantized version of Alibaba's Qwen3.6-27B, optimized for efficient deployment with vLLM on NVIDIA GPUs.
specs
| Task | Text Generation |
| Architecture | Hybrid Attention (Gated DeltaNet and Gated Attention) |
| Parameters | 27B |
| License | Apache 2.0 |
| Input Types | Text, Image, Video |
| Context Length | Up to 262K tokens |
about this model
nvidia/Qwen3.6-27B-NVFP4 is a text-generation model derived from Alibaba's Qwen3.6-27B, quantized to NVFP4 precision using NVIDIA Model Optimizer. It is a 27B-parameter auto-regressive language model with a hybrid attention architecture (Gated DeltaNet and Gated Attention) and supports a context length of up to 262,144 tokens. The model accepts text, image, and video inputs and generates text output. Designed for commercial or non-commercial use under the Apache 2.0 license.
Quantization and Performance
Weights and activations of linear operators within transformer blocks are quantized to the NVFP4 data type, reducing bits per parameter from 16 to 4. This yields approximately 2.5x reduction in disk size and GPU memory requirements while preserving accuracy. The model is optimized for NVIDIA Hopper and Blackwell GPUs and is served through the vLLM inference engine.
Accuracy benchmarks demonstrate that NVFP4 quantization introduces negligible degradation compared to the FP8 baseline, as shown below (benchmarked with temperature=1.0, top_p=0.95, max_num_tokens=81920 unless otherwise noted):
| Precision | MMLU Pro | GPQA Diamond | HLE | τ²-Bench Telecom | MMMU Pro | SciCode | AIME 2025 | AA-LCR | IFBench |
|---|---|---|---|---|---|---|---|---|---|
| FP8 | 86.1 | 86.0 | 21.7 | 95.2 | 74.6 | 44.8 | 93.1 | 68.8 | 65.1 |
| NVFP4 | 86.3 | 85.5 | 21.8 | 95.4 | 74.3 | 44.5 | 92.7 | 68.3 | 65.5 |
Baseline: Qwen3.6-27B-FP8. SciCode uses temperature=0.6; τ²-Bench Telecom uses temperature=0.0, top_p=1.0.
Deployment on Gigarouter
Gigarouter hosts this model as a managed, OpenAI-compatible API. Users send requests directly without managing infrastructure or performing local quantization. The model's optimizations enable efficient inference with minimal latency and memory footprint.
best for
- ·AI agent systems with tool-use capabilities
- ·Chatbots and conversational AI
- ·Retrieval-Augmented Generation (RAG) pipelines
- ·Long-context document analysis and reasoning
FAQ
It is best for AI agent systems, chatbots, RAG pipelines, and long-context reasoning tasks requiring up to 262K tokens of input.
The NVFP4 quantization reduces memory and disk size by ~2.5x while maintaining accuracy within 0.5–1% on most benchmarks, as shown in the evaluation table.
Inputs can be text, image (RGB), or video (MP4/WebM). Output is text. The model accepts a string and returns a string via the API.
The model is governed by the Apache 2.0 license, allowing commercial and non-commercial use.
Use the gigarouter OpenAI-compatible endpoint with your API key. Set the model name to "nvidia/Qwen3.6-27B-NVFP4" in your requests.
We're benchmarking and onboarding Qwen3.6 27B NVFP4 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.