skip to content
gigarouter gigarouter
models / text generation · coming soon

DeepSeek V4 Flash

antirez/deepseek-v4-gguf

published Apr 2026 · updated May 2026

DeepSeek V4 Flash is a text-generation model optimized for local inference via the DwarfStar engine, offering quasi-frontier performance with aggressive quantization.

status
coming soon
API providers
0
downloads / mo
6.4M
license
mit

specs

TaskText Generation
ArchitectureMixture of Experts (MoE) with routed and shared experts, MLA attention, and auxiliary blocks (compressor, indexer, HC)
ParametersHundreds of billions (exact count not specified)
LicenseMIT (GGUF redistribution under base model terms)

about this model

antirez/deepseek-v4-gguf is a text-generation model that provides quantized GGUF variants of DeepSeek V4 Flash, optimized for the DwarfStar inference engine and designed for high-quality local inference on high-end personal machines.

Quantization Strategy

The model uses an asymmetric quantization recipe: routed experts (the majority of parameters) receive aggressive quantization (IQ2_XXS for gate/up, Q2_K for down in the q2 variant; Q4_K for all three in the q4 variant), while critical decision-making components such as attention projections, shared experts, and the output head are preserved at Q8_0 or higher. This approach minimizes average quality loss while reducing model size.

Available Variants

FileSizeRouted experts (ffn_{gate,up,down}_exps)Everything else
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf80.8 GiBIQ2_XXS (gate, up) + Q2_K (down)Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf153.3 GiBQ4_K (all three)same as above
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf3.6 GiBMTP / speculative-decoding support (optional, not standalone)

The q2 variant is intended for machines with 128 GB RAM (or 96 GB with SSD streaming), while the q4 variant suits systems with ≥256 GB RAM. An optional MTP file enables speculative decoding.

Key Features

  • Official-vector validation: Logits are validated against the official DeepSeek implementation at multiple context sizes.
  • SSD streaming for KV cache: The DwarfStar engine treats KV cache as a disk citizen, enabling models to run on machines where the model does not fully fit in RAM.
  • Multi-backend support: Metal (primary), NVIDIA CUDA/DGX Spark, and Strix Halo (ROCm) are supported.
  • Distributed inference: The engine includes recently introduced distributed inference capabilities.

The model also supports DeepSeek V4 PRO on very high-memory machines. It is distributed under the MIT license, with the base model copyright held by DeepSeek.

best for

FAQ

What is the minimum RAM required to run this model?

The q2 variant requires at least 128 GB of RAM; MacBooks with 96 GB can use SSD streaming for KV cache.

What backends does the DwarfStar engine support?

Metal (macOS, primary), NVIDIA CUDA / DGX Spark, and Strix Halo (ROCm).

How does the quantization recipe affect quality?

Routed experts are aggressively quantized (IQ2_XXS/Q2_K for q2) while decision-making components like router and projections stay at Q8_0 or higher, preserving model behavior.

Can I use this model with other inference engines?

The GGUF files are specific to the DwarfStar engine; they may work with others, but the MTP module requires a compatible loader.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name as DeepSeek V4 Flash.

not yet live

We're benchmarking and onboarding DeepSeek V4 Flash as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

compare all →