aillmmultimodalvisionvlm type: entity 创建: 2026-04-27 更新: 2026-04-27

LLaVA

multimodal-models | vision-language-models

Overview

LLaVA (Large Language and Vision Assistant) is a large multimodal model that connects a vision encoder (CLIP) with an LLM (Vicuna/LLaMA) via a projection layer. It was one of the first open-source GPT-4V alternatives.

Key Variants

  • LLaVA-1.5: Improved projection with finetuning on academic datasets
  • LLaVA-1.6 (LLaVA-Llama3): Upgraded to Llama-3 / Vicuna-1.5
  • LLaVA-OneVision: Single model handling image, video, and text
  • LLaVA++: Variants extended to code and medical imaging

Relationship to Other Projects

References