GLM-5.2 NVFP4
nvidia/GLM-5.2-NVFP4
published Jun 2026 · updated Jun 2026
GLM-5.2 NVFP4 is a text-generation model that is a quantized version of ZAI's GLM-5.2 Mixture-of-Experts model, optimized for reasoning and coding with sparse attention and up to 1M context length.
specs
| Task | Text Generation |
| Architecture | GLM-5.2 (GlmMoeDsaForCausalLM) - Mixture-of-Experts with sparse attention |
| Parameters | 753B total, 40B activated |
| License | MIT License |
about this model
Architecture and Quantization
The model is built on ZAI’s GLM-5.2 base (40B activated parameters) and quantized using NVIDIA Model Optimizer v0.46.0. Only the weights and activations of linear operators within MoE expert transformer blocks are compressed to NVFP4; the shared expert remains unquantized. This preserves accuracy while reducing memory and latency.
Benchmark Performance
Accuracy is evaluated across five benchmarks spanning graduate-level reasoning (GPQA Diamond), scientific coding (SciCode), instruction following (IFBench), long-context recall (AA-LCR), and agentic tool use (τ²-Bench Telecom). NVFP4 scores closely match the FP8 baseline:
| Precision | GPQA Diamond | SciCode | IFBench | AA-LCR | τ²-Bench Telecom |
|---|---|---|---|---|---|
| Baseline (FP8) | 89.52 | 49.85 | 74.95 | 69.38 | 97.90 |
| NVFP4 | 89.39 | 49.04 | 75.81 | 70.13 | 98.25 |
Benchmarked with temperature 1.0, top_p 0.95; GPQA Diamond used max_new_tokens 100,000; others used 64,000. AA-LCR measured with SGLang; all others with vLLM.
Inference Compatibility
The model is designed for NVIDIA Blackwell GPUs (B200, B300) and runs on SGLang and vLLM. It uses a modelopt_fp4 quantization scheme and supports tool-call and reasoning parsers for agentic workflows.
Limitations
The base model may reproduce toxic or biased content from web-crawled training data. Responses can be inaccurate, incomplete, or socially undesirable even with neutral prompts. Developers should perform safety testing and mitigation before deployment.
best for
- ·AI agent systems and chatbots requiring long-context reasoning
- ·Scientific coding and instruction-following tasks
- ·Tool-use and agentic workflows in telecom or similar domains
FAQ
It supports a context length of up to 1 million tokens.
NVFP4 achieves nearly identical accuracy on benchmarks like GPQA Diamond (89.39 vs 89.52) and SciCode (49.04 vs 49.85), with slight improvements on IFBench and AA-LCR.
It is optimized for NVIDIA Blackwell GPUs (e.g., B200, B300) and requires Linux with SGLang or vLLM.
It is governed by the MIT License, allowing both commercial and non-commercial use.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name nvidia/GLM-5.2-NVFP4.
We're benchmarking and onboarding GLM-5.2 NVFP4 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.