Qwen3.6 27B NVFP4

nvidia/Qwen3.6-27B-NVFP4

published Jun 2026 · updated Jun 2026

Qwen3.6 27B NVFP4 is a text-generation model that is a quantized version of Alibaba's Qwen3.6-27B, optimized for efficient deployment with vLLM on NVIDIA GPUs.

status

coming soon

API providers

downloads / mo

94.5K

license

apache-2.0

specs

Task	Text Generation
Architecture	Hybrid Attention (Gated DeltaNet and Gated Attention)
Parameters	27B
License	Apache 2.0
Input Types	Text, Image, Video
Context Length	Up to 262K tokens

about this model

nvidia/Qwen3.6-27B-NVFP4 is a text-generation model derived from Alibaba's Qwen3.6-27B, quantized to NVFP4 precision using NVIDIA Model Optimizer. It is a 27B-parameter auto-regressive language model with a hybrid attention architecture (Gated DeltaNet and Gated Attention) and supports a context length of up to 262,144 tokens. The model accepts text, image, and video inputs and generates text output. Designed for commercial or non-commercial use under the Apache 2.0 license.

Quantization and Performance

Weights and activations of linear operators within transformer blocks are quantized to the NVFP4 data type, reducing bits per parameter from 16 to 4. This yields approximately 2.5x reduction in disk size and GPU memory requirements while preserving accuracy. The model is optimized for NVIDIA Hopper and Blackwell GPUs and is served through the vLLM inference engine.

Accuracy benchmarks demonstrate that NVFP4 quantization introduces negligible degradation compared to the FP8 baseline, as shown below (benchmarked with temperature=1.0, top_p=0.95, max_num_tokens=81920 unless otherwise noted):

Precision	MMLU Pro	GPQA Diamond	HLE	τ²-Bench Telecom	MMMU Pro	SciCode	AIME 2025	AA-LCR	IFBench
FP8	86.1	86.0	21.7	95.2	74.6	44.8	93.1	68.8	65.1
NVFP4	86.3	85.5	21.8	95.4	74.3	44.5	92.7	68.3	65.5

Baseline: Qwen3.6-27B-FP8. SciCode uses temperature=0.6; τ²-Bench Telecom uses temperature=0.0, top_p=1.0.

Deployment on Gigarouter

Gigarouter hosts this model as a managed, OpenAI-compatible API. Users send requests directly without managing infrastructure or performing local quantization. The model's optimizations enable efficient inference with minimal latency and memory footprint.

best for

·AI agent systems with tool-use capabilities
·Chatbots and conversational AI
·Retrieval-Augmented Generation (RAG) pipelines
·Long-context document analysis and reasoning

FAQ

What is this model best for?

It is best for AI agent systems, chatbots, RAG pipelines, and long-context reasoning tasks requiring up to 262K tokens of input.

How does this quantized version compare to the original FP8 version?

The NVFP4 quantization reduces memory and disk size by ~2.5x while maintaining accuracy within 0.5–1% on most benchmarks, as shown in the evaluation table.

What are the input and output formats?

Inputs can be text, image (RGB), or video (MP4/WebM). Output is text. The model accepts a string and returns a string via the API.

What license is it under?

The model is governed by the Apache 2.0 license, allowing commercial and non-commercial use.

How do I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Set the model name to "nvidia/Qwen3.6-27B-NVFP4" in your requests.

not yet live

We're benchmarking and onboarding Qwen3.6 27B NVFP4 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

9.2M dl/mo

deepseek-v4-gguf

6.4M dl/mo

Qwen3.6-35B-A3B-NVFP4

6.2M dl/mo

gemma-3-270m

5.1M dl/mo