Vision models
Chat with images for understanding, captioning & detection via API
Модели в коллекции
Сортировка: по популярности (run_count)Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
Low latency, low cost version of OpenAI's GPT-4o model
moondream2 is a small vision language model designed to run efficiently on edge devices
LLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)
The most intelligent Claude model and the first hybrid reasoning model on the market (claude-3-7-sonnet-20250219)
LLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)
Claude Sonnet 4 is a significant upgrade to 3.7, delivering superior coding and reasoning while responding more precisely to your instructions
A model which generates text in response to an input image and prompt.
LLaVA v1.6: Large Language and Vision Assistant (Nous-Hermes-2-34B)
Google’s hybrid “thinking” AI model optimized for speed and cost-efficiency
Fast, affordable version of GPT-4.1
powerful open-source visual language model
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Anthropic's most intelligent language model to date, with a 200K token context window and image understanding (claude-3-5-sonnet-20241022)
OpenAI's high-intelligence chat model
Latest model in the Qwen family for chatting with video and image models
Advanced text-image comprehension and composition based on InternLM
An instruction-tuned multimodal large language model that generates text based on user-provided prompts and images
BakLLaVA-1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Zero-shot / open vocabulary object detection
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Ollama Llama 3.2 Vision 11B
Ollama Llama 3.2 Vision 90B
🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️