Gemma 4 E4B IT

google/gemma-4-E4B-it

published Mar 2026 · updated Jun 2026

Gemma 4 E4B IT is an any-to-any model that processes text, image, and audio input to generate text output, with a 128K context window and efficient on-device deployment.

status

coming soon

API providers

downloads / mo

5.4M

license

apache-2.0

specs

Task	Multimodal text generation (text, image, audio input to text output)
Architecture	Dense with hybrid sliding window and global attention
Parameters	4.5B effective / 8B total (with embeddings)
Context Length	128K tokens
License	Apache 2.0

about this model

Gemma 4 E4B is an instruction-tuned multimodal model that processes text, image, and audio inputs to generate text output. Developed by Google DeepMind, it belongs to the Gemma 4 family of open-weight models and is optimized for reasoning, coding, and agentic workflows. The model uses a dense architecture with 4.5 billion effective parameters (8 billion total, including per-layer embeddings) and a hybrid attention mechanism that interleaves local sliding window (512 tokens) with global attention, supporting a context window of up to 128K tokens. It supports native system prompts, function calling, and a configurable thinking mode for step-by-step reasoning. Multilingual support covers over 140 languages during pretraining, with 35+ languages tested out of the box.

Key Benchmarks

The following results are for the instruction-tuned E4B variant. Gemma 3 27B (no thinking) is included for reference where available.

Benchmark	Gemma 4 E4B	Gemma 3 27B
MMLU Pro	69.4%	67.6%
AIME 2026 (no tools)	42.5%	20.8%
LiveCodeBench v6	52.0%	29.1%
GPQA Diamond	58.6%	42.4%
MMMU (text)	76.6%	70.7%
MMMU Pro (vision)	52.6%	49.7%
MATH-Vision	59.5%	46.0%
MRCR v2 8 needle 128k	25.4%	13.5%
CoVoST (audio ASR)	35.54	—
FLEURS (audio, lower is better)	0.08	—

Core Capabilities

Thinking: Built-in reasoning mode for step-by-step responses.
Multimodal vision: Object detection, document/PDF parsing, OCR, handwriting recognition, chart and UI understanding, variable aspect ratios and resolutions.
Video understanding: Analyze video via frame sequences.
Audio processing: Automatic speech recognition and speech-to-text translation in multiple languages.
Function calling: Native tool use for agentic workflows.
Coding: Code generation, completion, and correction.
Interleaved inputs: Freely mix text and images or audio in a single prompt.

best for

·On-device text and image generation
·Code generation and reasoning
·Audio transcription and speech-to-text translation

FAQ

What input modalities does Gemma 4 E4B IT support?

It supports text, image, and audio input, with text output.

What is the context window size?

It has a 128K token context window.

What is the license?

It is released under Apache 2.0.

How can I use this model via gigarouter?

Access it through the gigarouter OpenAI-compatible endpoint with an API key.

What is the effective parameter count?

It has 4.5B effective parameters and 8B total including embeddings.

not yet live

We're benchmarking and onboarding Gemma 4 E4B IT as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related multimodal models

compare all →

gemma-4-12B-it

3M dl/mo

gemma-4-E2B-it-qat-mobile-transformers

22.2K dl/mo