DeepSeek V4 Flash
bartowski/DeepSeek-V4-Flash-GGUF
published Jun 2026 · updated Jun 2026
DeepSeek V4 Flash is a text-generation model optimized for fast inference, with 284B total parameters and 13B activated, supporting a 1M token context.
specs
| Task | Text Generation |
| Architecture | Hybrid Attention (CSA + HCA), FP4 + FP8 mixed precision |
| Parameters | 284B total, 13B activated |
| Context Length | 1,000,000 tokens |
| License | MIT |
about this model
DeepSeek-V4-Flash is a text-generation model that combines a 284B total parameter Mixture-of-Experts (MoE) architecture with 13B activated parameters per token, supporting a 1 million token context window and using FP4 + FP8 mixed precision. It is hosted on gigarouter as an OpenAI-compatible API, eliminating the need for local installation or quantization.
Architecture and Key Strengths
The model employs hybrid attention combining Cross-Layer Attention (CSA) and Hybrid-Chunk Attention (HCA), Manifold-Constrained Hyper-Connections, and the Muon optimizer, as detailed in the technical report (arXiv:2606.19348). Its MoE design activates only 13B of its 284B total parameters per forward pass, enabling efficient inference while maintaining high capacity.
Benchmark Performance
DeepSeek-V4-Flash achieves the following scores on standard evaluations:
| Benchmark | Score | Rank |
|---|---|---|
| SWE-bench Verified | 79.0% resolved | 5 (among <500B models) |
| MMLU-Pro | 86.4% | 9 |
| GPQA Diamond | 88.1% | 10 |
| Terminal-Bench 2.0 | 56.9% | 8 |
| SkillsBench v1.1 | 44.7% | 5 (among <500B models) |
Inference and Availability
The model is provided in MXFP4 GGUF format (156 GB, split). It is released under the MIT license. No prompt format is specified in the original model card. Through gigarouter's hosted API, developers can access this model without managing local infrastructure or quantization.
best for
- ·Software engineering and code generation (79.0% on SWE-bench Verified)
- ·Complex reasoning tasks (MMLU-Pro 86.4%, GPQA Diamond 88.1%)
- ·Long-context document analysis and summarization
FAQ
No prompt format is specified; standard conversational or instruction-based formats may work, but you should experiment or refer to the original DeepSeek model documentation.
The single-file MXFP4 quant is 156.00 GB.
The original DeepSeek-V4-Flash model is released under the MIT license.
Use the OpenAI-compatible endpoint on gigarouter with your API key, specifying the model ID bartowski/DeepSeek-V4-Flash-GGUF.
Hosted providers have shown throughput ranging from 23.69 tok/s (DeepInfra) to 109.84 tok/s (Fireworks AI). Pricing varies from $0.18 to $0.28 per million output tokens.
We're benchmarking and onboarding DeepSeek V4 Flash as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.