tasks / grounding

Hosted grounding models

1 models · 0 live as APIs · benchmarked & compared

Grounding models align natural language descriptions with spatial locations in images, solving problems that require identifying and localizing objects based on textual queries. For example, an autonomous vehicle can use a grounding model to find a "stop sign obscured by foliage" in a camera frame, a manufacturing system can locate "scratched surface defects" on an assembly line, and a robotics application can instruct a gripper to "pick up the red cup on the left". These models eliminate the need for manual bounding-box annotation per scene, enabling flexible, query-driven perception.

In production, grounding models are typically integrated as an intermediate stage in a larger pipeline. An image and a text prompt are sent to the model, which returns coordinates or masks. Downstream components then act on those locations—such as triggering an alert, moving a robot arm, or cropping the region for further analysis. The model can be called as part of a real-time loop or in batch processing, depending on latency requirements.

Choosing between grounding models involves balancing size, quality, and speed. Larger models, such as nvidia/LocateAnything-3B (3 billion parameters), generally offer higher precision and recall across diverse queries but require more compute and produce higher latency. Smaller models complete inference faster and cost less per call but may miss fine-grained details or struggle with ambiguous prompts. The right choice depends on your application's tolerance for latency versus its need for accuracy.

For most call volumes, calling a hosted API beats self-hosting by removing the burden of GPU provisioning, scaling, and continuous maintenance, while providing predictable per-request pricing.

compare

model	params	downloads/mo	price	status
nvidia/LocateAnything-3B	-	-	$0.939 / 1k images	coming soon

get a key + $25 free →docs