Hosted multimodal models
3 models · 0 live as APIs · benchmarked & compared
Multimodal models process and generate content across multiple data types—such as text, images, and audio—enabling systems to understand and respond to complex inputs. For example, a multimodal model can caption a photograph, answer a question about a diagram, or extract information from a scanned invoice. These capabilities solve real-world problems in accessibility (generating alt-text for images), document analysis, and visual search.
In production, multimodal models are typically integrated into pipelines that accept diverse inputs: an image is encoded, its features are combined with a text prompt, and the model outputs a relevant response. They are used in customer support bots that analyse screenshots, in content moderation to flag inappropriate visuals, and in retrieval-augmented generation (RAG) systems that search both images and text. When choosing among models, the primary trade-off is between size and quality versus inference speed and cost. Larger models (e.g., a 400B-parameter variant) offer higher fidelity but require more compute and higher latency, while smaller models (e.g., a 12B-parameter variant) run faster and cheaper, suitable for high-throughput or latency-sensitive applications. The third model listed, a quantized variant, prioritises efficiency, particularly for mobile or edge deployment.
For most call volumes, using a hosted API eliminates the operational overhead of provisioning GPUs, managing scaling, and maintaining infrastructure, making it the more practical choice compared to self-hosting.
compare
| model | params | downloads/mo | price | status |
|---|---|---|---|---|
| google/gemma-4-E4B-it | 7996.2M | 5.4M | at launch | coming soon |
| google/gemma-4-12B-it | 11959.7M | 3M | at launch | coming soon |
| google/gemma-4-E2B-it-qat-mobile-transformers | 2337.4M | 22.2K | at launch | coming soon |