vLLM
vLLM
— vLLM
Print to PDF
Welcome to vLLM!
Contents
Welcome to vLLM!
Indices and tables
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and
GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-lora support
For more information, check out the following:
vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50
latency by Cade Daniel et al.
vLLM Meetups.
Documentation
Getting Started
Installation
Installation with ROCm
Installation with OpenVINO
Installation with CPU
Installation with Intel® Gaudi® AI Accelerators
Installation for ARM CPUs
Installation with Neuron
Installation with TPU
Installation with XPU
Quickstart Ask AI
Debugging Tips
Examples
Serving
OpenAI Compatible Server latest
Deploying with Docker
Deploying with Kubernetes
https://docs.vllm.ai/en/latest/ 2/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM
Introduction
https://docs.vllm.ai/en/latest/ 3/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM
Implementation
Performance
Benchmark Suites
Community
vLLM Meetups
Sponsors
API Documentation
Sampling Parameters
SamplingParams
Pooling Parameters
PoolingParams
Offline Inference
LLM Class
LLM Inputs
vLLM Engine
LLMEngine
AsyncLLMEngine
Design
Architecture Overview
Entrypoints
LLM Engine
Worker
Model Runner
Model Ask AI
Class Hierarchy
Integration with HuggingFace
vLLM’s Plugin System
How Plugins Work in vLLM
How vLLM Discovers Plugins
What Can Plugins Do? latest
Compatibility Guarantee
Input Processing
Guides
Module Contents
vLLM Paged Attention
Inputs
Concepts
Query
Key
QK
Softmax
Value
LV
Output
Multi-Modality
Guides
Module Contents
For Developers
Contributing to vLLM
License
Developing
Testing
Contribution Guidelines
Issues
Pull Requests & Code Reviews
Thank You Ask AI
Profiling vLLM
Example commands and usage:
Dockerfile
Index
Module Index
latest
https://docs.vllm.ai/en/latest/ 5/5