Gemma 4 E4B IT
google/gemma-4-E4B-it
published Mar 2026 · updated Jun 2026
Gemma 4 E4B IT is an any-to-any model that processes text, image, and audio input to generate text output, with a 128K context window and efficient on-device deployment.
specs
| Task | Multimodal text generation (text, image, audio input to text output) |
| Architecture | Dense with hybrid sliding window and global attention |
| Parameters | 4.5B effective / 8B total (with embeddings) |
| Context Length | 128K tokens |
| License | Apache 2.0 |
about this model
Gemma 4 E4B is an instruction-tuned multimodal model that processes text, image, and audio inputs to generate text output. Developed by Google DeepMind, it belongs to the Gemma 4 family of open-weight models and is optimized for reasoning, coding, and agentic workflows. The model uses a dense architecture with 4.5 billion effective parameters (8 billion total, including per-layer embeddings) and a hybrid attention mechanism that interleaves local sliding window (512 tokens) with global attention, supporting a context window of up to 128K tokens. It supports native system prompts, function calling, and a configurable thinking mode for step-by-step reasoning. Multilingual support covers over 140 languages during pretraining, with 35+ languages tested out of the box.
Key Benchmarks
The following results are for the instruction-tuned E4B variant. Gemma 3 27B (no thinking) is included for reference where available.
| Benchmark | Gemma 4 E4B | Gemma 3 27B |
|---|---|---|
| MMLU Pro | 69.4% | 67.6% |
| AIME 2026 (no tools) | 42.5% | 20.8% |
| LiveCodeBench v6 | 52.0% | 29.1% |
| GPQA Diamond | 58.6% | 42.4% |
| MMMU (text) | 76.6% | 70.7% |
| MMMU Pro (vision) | 52.6% | 49.7% |
| MATH-Vision | 59.5% | 46.0% |
| MRCR v2 8 needle 128k | 25.4% | 13.5% |
| CoVoST (audio ASR) | 35.54 | — |
| FLEURS (audio, lower is better) | 0.08 | — |
Core Capabilities
- Thinking: Built-in reasoning mode for step-by-step responses.
- Multimodal vision: Object detection, document/PDF parsing, OCR, handwriting recognition, chart and UI understanding, variable aspect ratios and resolutions.
- Video understanding: Analyze video via frame sequences.
- Audio processing: Automatic speech recognition and speech-to-text translation in multiple languages.
- Function calling: Native tool use for agentic workflows.
- Coding: Code generation, completion, and correction.
- Interleaved inputs: Freely mix text and images or audio in a single prompt.
best for
- ·On-device text and image generation
- ·Code generation and reasoning
- ·Audio transcription and speech-to-text translation
FAQ
It supports text, image, and audio input, with text output.
It has a 128K token context window.
It is released under Apache 2.0.
Access it through the gigarouter OpenAI-compatible endpoint with an API key.
It has 4.5B effective parameters and 8B total including embeddings.
We're benchmarking and onboarding Gemma 4 E4B IT as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.