BLIP-2
vision-language-models | multimodal-models
Overview
BLIP-2 (Bootstrapped Language-Image Pre-training 2) is a multimodal vision-language model from Salesforce Research. It uses a lightweight Q-Former to bridge a frozen vision encoder with a frozen LLM, dramatically reducing training compute.
Key Innovations
- Q-Former: Learned transformer that extracts visual features relevant to text
- Frozen modality alignment: Vision encoder and LLM stay frozen during training
- Strong zero-shot: Competitive zero-shot performance on VQAv2, OK-VQA
Relationship to Other Projects
- Preceded InstructBLIP and inspired MiniGPT-4
- Related to LLaVA and CogVLM in the efficient VLM space