skip to content
gigarouter gigarouter
models / vision-language · coming soon

llava-1.5-7b-hf

llava-hf/llava-1.5-7b-hf

A popular open vision-language model, with 3.2M downloads a month. gigarouter benchmarks and hosts it as an OpenAI-compatible API.

est. price
~$1.341
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
3.2M
license
llama2

about this model

LLaVA-1.5-7B is a vision-language model that processes images and text to generate multimodal responses, fine-tuned from LLaMA/Vicuna on GPT-generated instruction-following data. It is an auto-regressive transformer-based chatbot trained in September 2023.

Capabilities

The model supports multi-image and multi-prompt generation, allowing multiple images to be included in a single conversation. It requires a specific prompt template (USER: xxx\nASSISTANT:) with <image> tokens placed where visual input is queried.

Ideal Use Cases

  • Visual question answering (VQA)
  • Image captioning and description
  • Multimodal dialogue that combines text and visual context

Performance

Benchmark numbers are not provided in the model card. For detailed evaluation results, refer to the original LLaVA publication at llava-vl.github.io.

LLaVA model architecture diagram

not yet live

We're benchmarking and onboarding llava-1.5-7b-hf as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.