models / vision-language · coming soon

llava-1.5-7b-hf

llava-hf/llava-1.5-7b-hf

A popular open vision-language model, with 3.2M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price

~$1.341

/ 1k images · estimated, set at launch

API providers

downloads / mo

3.2M

license

llama2

about this model

LLaVA-1.5-7B is a vision-language model that processes images and text to generate multimodal responses, fine-tuned from LLaMA/Vicuna on GPT-generated instruction-following data. It is an auto-regressive transformer-based chatbot trained in September 2023.

Capabilities

The model supports multi-image and multi-prompt generation, allowing multiple images to be included in a single conversation. It requires a specific prompt template (USER: xxx\nASSISTANT:) with <image> tokens placed where visual input is queried.

Ideal Use Cases

Visual question answering (VQA)
Image captioning and description
Multimodal dialogue that combines text and visual context

Performance

Benchmark numbers are not provided in the model card. For detailed evaluation results, refer to the original LLaVA publication at llava-vl.github.io.

LLaVA model architecture diagram

not yet live

We're benchmarking and onboarding llava-1.5-7b-hf as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.