Caption Images
Use AI To Caption Images with an API
Модели в коллекции
Сортировка: по популярности (run_count)Generate image captions
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
Answers questions about images
moondream2 is a small vision language model designed to run efficiently on edge devices
The CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!
Get an approximate text prompt, with style, matching an image. (Optimized for stable-diffusion (clip ViT-L/14))
A model which generates text in response to an input image and prompt.
Simple image captioning model using CLIP and GPT-2
Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)
allenai/Molmo-7B-D-0924, Answers questions and caption about images
CLIP Interrogator for SDXL optimizes text prompts to match a given image
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Latest model in the Qwen family for chatting with video and image models
Fine-grained Image Captioning with CLIP Reward
An instruction-tuned multi-modal model based on BLIP-2 and Vicuna-13B
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
An instruction-tuned multimodal large language model that generates text based on user-provided prompts and images
Fuyu-8B is a multi-modal text and image transformer trained by Adept AI
datasets: Flickr8k
SmolVLM-Instruct by HuggingFaceTB
Projection module trained to add vision capabilties to Llama 3 using SigLIP
Ollama Llama 3.2 Vision 11B
Ollama Llama 3.2 Vision 90B
Idefics3-8B-Llama3, Answers questions and caption about images
🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
A wrapper model for captioning multiple images using GPT, Claude or Gemini, useful for lora training