LLaVA

multimodal-models | vision-language-models

Overview

LLaVA (Large Language and Vision Assistant) is a large multimodal model that connects a vision encoder (CLIP) with an LLM (Vicuna/LLaMA) via a projection layer. It was one of the first open-source GPT-4V alternatives.

Key Variants

LLaVA-1.5: Improved projection with finetuning on academic datasets
LLaVA-1.6 (LLaVA-Llama3): Upgraded to Llama-3 / Vicuna-1.5
LLaVA-OneVision: Single model handling image, video, and text
LLaVA++: Variants extended to code and medical imaging

Relationship to Other Projects

Preceded by Mini-Gemini and CogVLM in the open VLM space
Competitor to GPT-4V and gemini in the proprietary space

References

GitHub: https://github.com/haotian-liu/LLaVA