vLLM
llm-inference | open-source-llm-projects
Overview
vLLM is a high-throughput and memory-efficient inference engine for LLMs, developed by UC Berkeley. It uses PagedAttention to manage KV cache memory dynamically, enabling much higher throughput than naive attention implementations.
Key Features
- PagedAttention: Virtual memory management for KV cache; reduces memory waste significantly
- Continuous batching: Dynamic batching of requests without padding to maximum sequence length
- Tensor parallelism: Distributed inference across multiple GPUs
- OpenAI-compatible API: Drop-in replacement for OpenAI API servers
- FlashAttention integration: Fast exact attention via FlashAttention-2
Relationship to Other Projects
- Competes with SGLang and llama.cpp in the LLM serving space
- Often compared in benchmarks against vllm references in mlc-llm