LLaVA
multimodal-models | vision-language-models
Overview
LLaVA (Large Language and Vision Assistant) is a large multimodal model that connects a vision encoder (CLIP) with an LLM (Vicuna/LLaMA) via a projection layer. It was one of the first open-source GPT-4V alternatives.
Key Variants
- LLaVA-1.5: Improved projection with finetuning on academic datasets
- LLaVA-1.6 (LLaVA-Llama3): Upgraded to Llama-3 / Vicuna-1.5
- LLaVA-OneVision: Single model handling image, video, and text
- LLaVA++: Variants extended to code and medical imaging
Relationship to Other Projects
- Preceded by Mini-Gemini and CogVLM in the open VLM space
- Competitor to GPT-4V and gemini in the proprietary space