DeepSeek V4 Flash

Q: What prompt format should I use for this model?

No prompt format is specified; standard conversational or instruction-based formats may work, but you should experiment or refer to the original DeepSeek model documentation.

Q: How large is the MXFP4 GGUF file?

The single-file MXFP4 quant is 156.00 GB.

Q: What is the license for this model?

The original DeepSeek-V4-Flash model is released under the MIT license.

Q: How can I call this model via API on gigarouter?

Use the OpenAI-compatible endpoint on gigarouter with your API key, specifying the model ID bartowski/DeepSeek-V4-Flash-GGUF.

Q: What inference performance does this model offer?

Hosted providers have shown throughput ranging from 23.69 tok/s (DeepInfra) to 109.84 tok/s (Fireworks AI). Pricing varies from $0.18 to $0.28 per million output tokens.

bartowski/DeepSeek-V4-Flash-GGUF

published Jun 2026 · updated Jun 2026

DeepSeek V4 Flash is a text-generation model optimized for fast inference, with 284B total parameters and 13B activated, supporting a 1M token context.

status

coming soon

API providers

downloads / mo

234.8K

license

mit

specs

Task	Text Generation
Architecture	Hybrid Attention (CSA + HCA), FP4 + FP8 mixed precision
Parameters	284B total, 13B activated
Context Length	1,000,000 tokens
License	MIT

about this model

DeepSeek-V4-Flash is a text-generation model that combines a 284B total parameter Mixture-of-Experts (MoE) architecture with 13B activated parameters per token, supporting a 1 million token context window and using FP4 + FP8 mixed precision. It is hosted on gigarouter as an OpenAI-compatible API, eliminating the need for local installation or quantization.

Architecture and Key Strengths

The model employs hybrid attention combining Cross-Layer Attention (CSA) and Hybrid-Chunk Attention (HCA), Manifold-Constrained Hyper-Connections, and the Muon optimizer, as detailed in the technical report (arXiv:2606.19348). Its MoE design activates only 13B of its 284B total parameters per forward pass, enabling efficient inference while maintaining high capacity.

Benchmark Performance

DeepSeek-V4-Flash achieves the following scores on standard evaluations:

Benchmark	Score	Rank
SWE-bench Verified	79.0% resolved	5 (among <500B models)
MMLU-Pro	86.4%	9
GPQA Diamond	88.1%	10
Terminal-Bench 2.0	56.9%	8
SkillsBench v1.1	44.7%	5 (among <500B models)

Inference and Availability

The model is provided in MXFP4 GGUF format (156 GB, split). It is released under the MIT license. No prompt format is specified in the original model card. Through gigarouter's hosted API, developers can access this model without managing local infrastructure or quantization.

best for

·Software engineering and code generation (79.0% on SWE-bench Verified)
·Complex reasoning tasks (MMLU-Pro 86.4%, GPQA Diamond 88.1%)
·Long-context document analysis and summarization

FAQ

What prompt format should I use for this model?

No prompt format is specified; standard conversational or instruction-based formats may work, but you should experiment or refer to the original DeepSeek model documentation.

How large is the MXFP4 GGUF file?

The single-file MXFP4 quant is 156.00 GB.

What is the license for this model?

The original DeepSeek-V4-Flash model is released under the MIT license.

How can I call this model via API on gigarouter?

Use the OpenAI-compatible endpoint on gigarouter with your API key, specifying the model ID bartowski/DeepSeek-V4-Flash-GGUF.

What inference performance does this model offer?

Hosted providers have shown throughput ranging from 23.69 tok/s (DeepInfra) to 109.84 tok/s (Fireworks AI). Pricing varies from $0.18 to $0.28 per million output tokens.

not yet live

We're benchmarking and onboarding DeepSeek V4 Flash as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

9.2M dl/mo

deepseek-v4-gguf

6.4M dl/mo

Qwen3.6-35B-A3B-NVFP4

6.2M dl/mo

gemma-3-270m

5.1M dl/mo