skip to content
gigarouter gigarouter
models / multimodal · coming soon

Gemma 4 E4B IT

google/gemma-4-E4B-it

published Mar 2026 · updated Jun 2026

Gemma 4 E4B IT is an any-to-any model that processes text, image, and audio input to generate text output, with a 128K context window and efficient on-device deployment.

status
coming soon
API providers
0
downloads / mo
5.4M
license
apache-2.0

specs

TaskMultimodal text generation (text, image, audio input to text output)
ArchitectureDense with hybrid sliding window and global attention
Parameters4.5B effective / 8B total (with embeddings)
Context Length128K tokens
LicenseApache 2.0

about this model

Gemma 4 E4B is an instruction-tuned multimodal model that processes text, image, and audio inputs to generate text output. Developed by Google DeepMind, it belongs to the Gemma 4 family of open-weight models and is optimized for reasoning, coding, and agentic workflows. The model uses a dense architecture with 4.5 billion effective parameters (8 billion total, including per-layer embeddings) and a hybrid attention mechanism that interleaves local sliding window (512 tokens) with global attention, supporting a context window of up to 128K tokens. It supports native system prompts, function calling, and a configurable thinking mode for step-by-step reasoning. Multilingual support covers over 140 languages during pretraining, with 35+ languages tested out of the box.

Key Benchmarks

The following results are for the instruction-tuned E4B variant. Gemma 3 27B (no thinking) is included for reference where available.

BenchmarkGemma 4 E4BGemma 3 27B
MMLU Pro69.4%67.6%
AIME 2026 (no tools)42.5%20.8%
LiveCodeBench v652.0%29.1%
GPQA Diamond58.6%42.4%
MMMU (text)76.6%70.7%
MMMU Pro (vision)52.6%49.7%
MATH-Vision59.5%46.0%
MRCR v2 8 needle 128k25.4%13.5%
CoVoST (audio ASR)35.54
FLEURS (audio, lower is better)0.08

Core Capabilities

  • Thinking: Built-in reasoning mode for step-by-step responses.
  • Multimodal vision: Object detection, document/PDF parsing, OCR, handwriting recognition, chart and UI understanding, variable aspect ratios and resolutions.
  • Video understanding: Analyze video via frame sequences.
  • Audio processing: Automatic speech recognition and speech-to-text translation in multiple languages.
  • Function calling: Native tool use for agentic workflows.
  • Coding: Code generation, completion, and correction.
  • Interleaved inputs: Freely mix text and images or audio in a single prompt.

best for

FAQ

What input modalities does Gemma 4 E4B IT support?

It supports text, image, and audio input, with text output.

What is the context window size?

It has a 128K token context window.

What is the license?

It is released under Apache 2.0.

How can I use this model via gigarouter?

Access it through the gigarouter OpenAI-compatible endpoint with an API key.

What is the effective parameter count?

It has 4.5B effective parameters and 8B total including embeddings.

not yet live

We're benchmarking and onboarding Gemma 4 E4B IT as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related multimodal models

compare all →