GLM-5.2 NVFP4

nvidia/GLM-5.2-NVFP4

published Jun 2026 · updated Jun 2026

GLM-5.2 NVFP4 is a text-generation model that is a quantized version of ZAI's GLM-5.2 Mixture-of-Experts model, optimized for reasoning and coding with sparse attention and up to 1M context length.

status

coming soon

API providers

downloads / mo

190K

license

mit

specs

Task	Text Generation
Architecture	GLM-5.2 (GlmMoeDsaForCausalLM) - Mixture-of-Experts with sparse attention
Parameters	753B total, 40B activated
License	MIT License

about this model

GLM-5.2 NVFP4 is a text-generation model that combines a 753B-parameter Mixture-of-Experts (MoE) architecture with sparse attention and a 1M-token context window, optimized for reasoning, coding, and long-context tasks through 4-bit NVIDIA FP4 quantization.

Architecture and Quantization

The model is built on ZAI’s GLM-5.2 base (40B activated parameters) and quantized using NVIDIA Model Optimizer v0.46.0. Only the weights and activations of linear operators within MoE expert transformer blocks are compressed to NVFP4; the shared expert remains unquantized. This preserves accuracy while reducing memory and latency.

Benchmark Performance

Accuracy is evaluated across five benchmarks spanning graduate-level reasoning (GPQA Diamond), scientific coding (SciCode), instruction following (IFBench), long-context recall (AA-LCR), and agentic tool use (τ²-Bench Telecom). NVFP4 scores closely match the FP8 baseline:

Precision	GPQA Diamond	SciCode	IFBench	AA-LCR	τ²-Bench Telecom
Baseline (FP8)	89.52	49.85	74.95	69.38	97.90
NVFP4	89.39	49.04	75.81	70.13	98.25

Benchmarked with temperature 1.0, top_p 0.95; GPQA Diamond used max_new_tokens 100,000; others used 64,000. AA-LCR measured with SGLang; all others with vLLM.

Inference Compatibility

The model is designed for NVIDIA Blackwell GPUs (B200, B300) and runs on SGLang and vLLM. It uses a modelopt_fp4 quantization scheme and supports tool-call and reasoning parsers for agentic workflows.

Limitations

The base model may reproduce toxic or biased content from web-crawled training data. Responses can be inaccurate, incomplete, or socially undesirable even with neutral prompts. Developers should perform safety testing and mitigation before deployment.

best for

·AI agent systems and chatbots requiring long-context reasoning
·Scientific coding and instruction-following tasks
·Tool-use and agentic workflows in telecom or similar domains

FAQ

What is the context length supported by GLM-5.2 NVFP4?

It supports a context length of up to 1 million tokens.

How does the NVFP4 quantized version compare to the FP8 baseline in accuracy?

NVFP4 achieves nearly identical accuracy on benchmarks like GPQA Diamond (89.39 vs 89.52) and SciCode (49.04 vs 49.85), with slight improvements on IFBench and AA-LCR.

What hardware is required to run this model?

It is optimized for NVIDIA Blackwell GPUs (e.g., B200, B300) and requires Linux with SGLang or vLLM.

What is the license for using GLM-5.2 NVFP4?

It is governed by the MIT License, allowing both commercial and non-commercial use.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name nvidia/GLM-5.2-NVFP4.

not yet live

We're benchmarking and onboarding GLM-5.2 NVFP4 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

9.2M dl/mo

deepseek-v4-gguf

6.4M dl/mo

Qwen3.6-35B-A3B-NVFP4

6.2M dl/mo

gemma-3-270m

5.1M dl/mo