DeepSeek V4 Flash
antirez/deepseek-v4-gguf
published Apr 2026 · updated May 2026
DeepSeek V4 Flash is a text-generation model optimized for local inference via the DwarfStar engine, offering quasi-frontier performance with aggressive quantization.
specs
| Task | Text Generation |
| Architecture | Mixture of Experts (MoE) with routed and shared experts, MLA attention, and auxiliary blocks (compressor, indexer, HC) |
| Parameters | Hundreds of billions (exact count not specified) |
| License | MIT (GGUF redistribution under base model terms) |
about this model
antirez/deepseek-v4-gguf is a text-generation model that provides quantized GGUF variants of DeepSeek V4 Flash, optimized for the DwarfStar inference engine and designed for high-quality local inference on high-end personal machines.
Quantization Strategy
The model uses an asymmetric quantization recipe: routed experts (the majority of parameters) receive aggressive quantization (IQ2_XXS for gate/up, Q2_K for down in the q2 variant; Q4_K for all three in the q4 variant), while critical decision-making components such as attention projections, shared experts, and the output head are preserved at Q8_0 or higher. This approach minimizes average quality loss while reducing model size.
Available Variants
| File | Size | Routed experts (ffn_{gate,up,down}_exps) | Everything else |
|---|---|---|---|
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf | 80.8 GiB | IQ2_XXS (gate, up) + Q2_K (down) | Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias |
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf | 153.3 GiB | Q4_K (all three) | same as above |
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf | 3.6 GiB | MTP / speculative-decoding support (optional, not standalone) | |
The q2 variant is intended for machines with 128 GB RAM (or 96 GB with SSD streaming), while the q4 variant suits systems with ≥256 GB RAM. An optional MTP file enables speculative decoding.
Key Features
- Official-vector validation: Logits are validated against the official DeepSeek implementation at multiple context sizes.
- SSD streaming for KV cache: The DwarfStar engine treats KV cache as a disk citizen, enabling models to run on machines where the model does not fully fit in RAM.
- Multi-backend support: Metal (primary), NVIDIA CUDA/DGX Spark, and Strix Halo (ROCm) are supported.
- Distributed inference: The engine includes recently introduced distributed inference capabilities.
The model also supports DeepSeek V4 PRO on very high-memory machines. It is distributed under the MIT license, with the base model copyright held by DeepSeek.
best for
- ·Running a powerful open-weight model locally on high-end Mac or Linux machines with 128 GB+ RAM
- ·Long-context coding assistance and agent integration with tool calling support
- ·Speculative decoding with the optional MTP module for faster generation
FAQ
The q2 variant requires at least 128 GB of RAM; MacBooks with 96 GB can use SSD streaming for KV cache.
Metal (macOS, primary), NVIDIA CUDA / DGX Spark, and Strix Halo (ROCm).
Routed experts are aggressively quantized (IQ2_XXS/Q2_K for q2) while decision-making components like router and projections stay at Q8_0 or higher, preserving model behavior.
The GGUF files are specific to the DwarfStar engine; they may work with others, but the MTP module requires a compatible loader.
Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name as DeepSeek V4 Flash.
We're benchmarking and onboarding DeepSeek V4 Flash as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.