0% found this document useful (0 votes)
6 views

vLLM

vLLM is a fast and user-friendly library designed for large language model (LLM) inference and serving, featuring high throughput and efficient memory management. It supports various hardware and offers seamless integration with popular HuggingFace models, along with multiple decoding algorithms. The documentation includes installation guides, performance tuning, and community resources for developers.

Uploaded by

Aleks Balaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

vLLM

vLLM is a fast and user-friendly library designed for large language model (LLM) inference and serving, featuring high throughput and efficient memory management. It supports various hardware and offers seamless integration with popular HuggingFace models, along with multiple decoding algorithms. The documentation includes installation guides, performance tuning, and community resources for developers.

Uploaded by

Aleks Balaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

02/12/2024, 22:08 Welcome to vLLM!

— vLLM

Print to PDF
Welcome to vLLM!
Contents
Welcome to vLLM!
Indices and tables

Easy, fast, and cheap LLM serving for everyone


Star Watch Fork
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding Ask AI
Chunked prefill
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling,
beam search, and more latest

Tensor parallelism and pipeline parallelism support for distributed inference


https://docs.vllm.ai/en/latest/ 1/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and
GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-lora support
For more information, check out the following:
vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50
latency by Cade Daniel et al.
vLLM Meetups.

Documentation
Getting Started
Installation
Installation with ROCm
Installation with OpenVINO
Installation with CPU
Installation with Intel® Gaudi® AI Accelerators
Installation for ARM CPUs
Installation with Neuron
Installation with TPU
Installation with XPU
Quickstart Ask AI

Debugging Tips
Examples
Serving
OpenAI Compatible Server latest
Deploying with Docker
Deploying with Kubernetes
https://docs.vllm.ai/en/latest/ 2/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Deploying with Nginx Loadbalancer


Distributed Inference and Serving
Production Metrics
Environment Variables
Usage Stats Collection
Integrations
Loading Models with CoreWeave’s Tensorizer
Compatibility Matrix
Frequently Asked Questions
Models
Supported Models
Model Support Policy
Adding a New Model
Enabling Multimodal Inputs
Engine Arguments
Using LoRA adapters
Using VLMs
Structured Outputs
Speculative decoding in vLLM
Performance and Tuning
Quantization
Supported Hardware for Quantization Kernels
AutoAWQ
BitsAndBytes
GGUF Ask AI
INT8 W8A8
FP8 W8A8
FP8 E5M2 KV Cache
FP8 E4M3 KV Cache
Automatic Prefix Caching latest

Introduction
https://docs.vllm.ai/en/latest/ 3/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Implementation
Performance
Benchmark Suites
Community
vLLM Meetups
Sponsors
API Documentation
Sampling Parameters
SamplingParams

Pooling Parameters
PoolingParams

Offline Inference
LLM Class
LLM Inputs
vLLM Engine
LLMEngine
AsyncLLMEngine
Design
Architecture Overview
Entrypoints
LLM Engine
Worker
Model Runner
Model Ask AI
Class Hierarchy
Integration with HuggingFace
vLLM’s Plugin System
How Plugins Work in vLLM
How vLLM Discovers Plugins
What Can Plugins Do? latest

Guidelines for Writing Plugins


https://docs.vllm.ai/en/latest/ 4/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Compatibility Guarantee
Input Processing
Guides
Module Contents
vLLM Paged Attention
Inputs
Concepts
Query
Key
QK
Softmax
Value
LV
Output
Multi-Modality
Guides
Module Contents
For Developers
Contributing to vLLM
License
Developing
Testing
Contribution Guidelines
Issues
Pull Requests & Code Reviews
Thank You Ask AI
Profiling vLLM
Example commands and usage:
Dockerfile
Index
Module Index
latest

https://docs.vllm.ai/en/latest/ 5/5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy