0% found this document useful (0 votes)

6 views

vLLM

vLLM is a fast and user-friendly library designed for large language model (LLM) inference and serving, featuring high throughput and efficient memory management. It supports various hardware and offers seamless integration with popular HuggingFace models, along with multiple decoding algorithms. The documentation includes installation guides, performance tuning, and community resources for developers.

Uploaded by

Aleks Balaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

vLLM

Uploaded by

Aleks Balaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

02/12/2024, 22:08 Welcome to vLLM!

— vLLM

Print to PDF
Welcome to vLLM!
Contents
Welcome to vLLM!
Indices and tables

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding Ask AI
Chunked prefill
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling,
beam search, and more latest

Tensor parallelism and pipeline parallelism support for distributed inference

https://docs.vllm.ai/en/latest/ 1/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and
GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-lora support
For more information, check out the following:
vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50
latency by Cade Daniel et al.
vLLM Meetups.

Documentation
Getting Started
Installation
Installation with ROCm
Installation with OpenVINO
Installation with CPU
Installation with Intel® Gaudi® AI Accelerators
Installation for ARM CPUs
Installation with Neuron
Installation with TPU
Installation with XPU
Quickstart Ask AI

Debugging Tips
Examples
Serving
OpenAI Compatible Server latest
Deploying with Docker
Deploying with Kubernetes
https://docs.vllm.ai/en/latest/ 2/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Deploying with Nginx Loadbalancer

Distributed Inference and Serving
Production Metrics
Environment Variables
Usage Stats Collection
Integrations
Loading Models with CoreWeave’s Tensorizer
Compatibility Matrix
Frequently Asked Questions
Models
Supported Models
Model Support Policy
Adding a New Model
Enabling Multimodal Inputs
Engine Arguments
Using LoRA adapters
Using VLMs
Structured Outputs
Speculative decoding in vLLM
Performance and Tuning
Quantization
Supported Hardware for Quantization Kernels
AutoAWQ
BitsAndBytes
GGUF Ask AI
INT8 W8A8
FP8 W8A8
FP8 E5M2 KV Cache
FP8 E4M3 KV Cache
Automatic Prefix Caching latest

Introduction
https://docs.vllm.ai/en/latest/ 3/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Implementation
Performance
Benchmark Suites
Community
vLLM Meetups
Sponsors
API Documentation
Sampling Parameters
SamplingParams

Pooling Parameters
PoolingParams

Offline Inference
LLM Class
LLM Inputs
vLLM Engine
LLMEngine
AsyncLLMEngine
Design
Architecture Overview
Entrypoints
LLM Engine
Worker
Model Runner
Model Ask AI
Class Hierarchy
Integration with HuggingFace
vLLM’s Plugin System
How Plugins Work in vLLM
How vLLM Discovers Plugins
What Can Plugins Do? latest

Guidelines for Writing Plugins

https://docs.vllm.ai/en/latest/ 4/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Compatibility Guarantee
Input Processing
Guides
Module Contents
vLLM Paged Attention
Inputs
Concepts
Query
Key
QK
Softmax
Value
LV
Output
Multi-Modality
Guides
Module Contents
For Developers
Contributing to vLLM
License
Developing
Testing
Contribution Guidelines
Issues
Pull Requests & Code Reviews
Thank You Ask AI
Profiling vLLM
Example commands and usage:
Dockerfile
Index
Module Index
latest

https://docs.vllm.ai/en/latest/ 5/5

Node.js Design Patterns - Second Edition
From Everand
Node.js Design Patterns - Second Edition
Mario Casciaro
4.5/5 (4)
PHP Microservices
From Everand
PHP Microservices
Carlos Pérez Sánchez
3/5 (1)
Mastering KVM Virtualization
From Everand
Mastering KVM Virtualization
Humble Devassy Chirammal
5/5 (1)
Mastering Proxmox - Second Edition
From Everand
Mastering Proxmox - Second Edition
Wasim Ahmed
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Do It Yourself - Vaidik Havan
100% (5)
Do It Yourself - Vaidik Havan
39 pages
Docs VLLM Ai en Stable
No ratings yet
Docs VLLM Ai en Stable
35 pages
VLLM: Using PagedAttention To Optimize LLM Inference and Serving
No ratings yet
VLLM: Using PagedAttention To Optimize LLM Inference and Serving
6 pages
2025-03-06 - VLLM Office Hours - VLLM Production Stack
No ratings yet
2025-03-06 - VLLM Office Hours - VLLM Production Stack
52 pages
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
From Everand
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
Ian Taylor
No ratings yet
Mastering Go Network Automation
From Everand
Mastering Go Network Automation
Ian Taylor
No ratings yet
OpenNebula 3 Cloud Computing
From Everand
OpenNebula 3 Cloud Computing
Giovanni Toraldo
No ratings yet
VMware vSphere Troubleshooting
From Everand
VMware vSphere Troubleshooting
Munir Muhammad Zeeshan
No ratings yet
Node.js Design Patterns
From Everand
Node.js Design Patterns
Mario Casciaro
4/5 (3)
Advanced Apache Camel: Integration Patterns for Complex Systems
From Everand
Advanced Apache Camel: Integration Patterns for Complex Systems
Peter Jones
No ratings yet
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
From Everand
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
Tim Peters
No ratings yet
The JVM Handbook: A Developer’s Guide to Java Virtual Machine
From Everand
The JVM Handbook: A Developer’s Guide to Java Virtual Machine
Robert Johnson
No ratings yet
Troubleshooting NetScaler
From Everand
Troubleshooting NetScaler
Raghu Varma Tirumalaraju
No ratings yet
Microsoft Hyper-V Cluster Design
From Everand
Microsoft Hyper-V Cluster Design
Eric Siron
No ratings yet
VMware vRealize Orchestrator Essentials: Get hands-on experience with vRealize Orchestrator and automate your VMware environment
From Everand
VMware vRealize Orchestrator Essentials: Get hands-on experience with vRealize Orchestrator and automate your VMware environment
Daniel Langenhan
No ratings yet
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
No ratings yet
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
16 pages
Clojure High Performance Programming, Second Edition: Become an expert at writing fast and high performant code in Clojure 1.7.0
From Everand
Clojure High Performance Programming, Second Edition: Become an expert at writing fast and high performant code in Clojure 1.7.0
Shantanu Kumar
No ratings yet
Getting Started with LLVM Core Libraries
From Everand
Getting Started with LLVM Core Libraries
Bruno Cardoso Lopes
No ratings yet
Puma Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
From Everand
Puma Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Shell for DevOps
From Everand
Mastering Shell for DevOps
Gilbert Stew
No ratings yet
Mastering Shell for DevOps: Automate, streamline, and secure DevOps workflows with modern shell scripting
From Everand
Mastering Shell for DevOps: Automate, streamline, and secure DevOps workflows with modern shell scripting
Gilbert Stew
No ratings yet
Troubleshooting Docker
From Everand
Troubleshooting Docker
John Wooten
No ratings yet
Podman Essentials: Definitive Reference for Developers and Engineers
From Everand
Podman Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Emerging Architectures for LLM Applications _ Andreessen Horowitz
No ratings yet
Emerging Architectures for LLM Applications _ Andreessen Horowitz
15 pages
Yarn Essentials: Definitive Reference for Developers and Engineers
From Everand
Yarn Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
From Everand
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
Peter Jones
No ratings yet
Mastering Edge Computing: Scalable Application Development with Azure
From Everand
Mastering Edge Computing: Scalable Application Development with Azure
Peter Jones
No ratings yet
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
LLMOps Toolkit - Prashant Sahu
No ratings yet
LLMOps Toolkit - Prashant Sahu
12 pages
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Proxmox Administration Essentials: Definitive Reference for Developers and Engineers
From Everand
Proxmox Administration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learning VMware vRealize Automation: Learn the fundamentals of vRealize Automation to accelerate the delivery of your IT services
From Everand
Learning VMware vRealize Automation: Learn the fundamentals of vRealize Automation to accelerate the delivery of your IT services
Sriram Rajendran
No ratings yet
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
LEMP Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
LEMP Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Oracle Advanced PL/SQL Developer Professional Guide
From Everand
Oracle Advanced PL/SQL Developer Professional Guide
Saurabh K. Gupta
4/5 (8)
Tomcat Administration and Deployment: Definitive Reference for Developers and Engineers
From Everand
Tomcat Administration and Deployment: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learning Azure DevOps
From Everand
Learning Azure DevOps
Myra Kelnor
No ratings yet
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
From Everand
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
Myra Kelnor
No ratings yet
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
From Everand
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
Omar Khedher
No ratings yet
IBM WebSphere Portal 8: Web Experience Factory and the Cloud
From Everand
IBM WebSphere Portal 8: Web Experience Factory and the Cloud
Chelis Camargo
No ratings yet
Professional Heroku Programming
From Everand
Professional Heroku Programming
Chris Kemp
4/5 (2)
Windows Server 2012 Hyper-V: Deploying Hyper-V Enterprise Server Virtualization Platform
From Everand
Windows Server 2012 Hyper-V: Deploying Hyper-V Enterprise Server Virtualization Platform
Zahir Hussain Shah
No ratings yet
Mastering JVM Performance Tuning and Optimization: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering JVM Performance Tuning and Optimization: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Efficient DevOps Automation with AWS CodeStar: Definitive Reference for Developers and Engineers
From Everand
Efficient DevOps Automation with AWS CodeStar: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Native Docker Clustering with Swarm
From Everand
Native Docker Clustering with Swarm
Fabrizio Soppelsa
No ratings yet
MVVM Survival Guide for Enterprise Architectures in Silverlight and WPF
From Everand
MVVM Survival Guide for Enterprise Architectures in Silverlight and WPF
Ryan Vice
No ratings yet
Apache Tomcat 7 Essentials
From Everand
Apache Tomcat 7 Essentials
Tanuj Khare
No ratings yet
FreeSWITCH 1.2
From Everand
FreeSWITCH 1.2
Anthony Minessale
No ratings yet
Light-4j Architecture and Development Guide: Definitive Reference for Developers and Engineers
From Everand
Light-4j Architecture and Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenStack Orchestration
From Everand
OpenStack Orchestration
Adnan Ahmed Siddiqui
5/5 (1)
OpenStack Object Storage (Swift) Essentials
From Everand
OpenStack Object Storage (Swift) Essentials
Amar Kapadia
No ratings yet
VirtualBox Essentials: Definitive Reference for Developers and Engineers
From Everand
VirtualBox Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Prometheus: Monitor your infrastructure, applications, and services with expert tips and tricks
From Everand
Mastering Prometheus: Monitor your infrastructure, applications, and services with expert tips and tricks
William Hegedus
No ratings yet
Flannel Networking Essentials: Definitive Reference for Developers and Engineers
From Everand
Flannel Networking Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Filesm
No ratings yet
Filesm
5 pages
Green-Rpms Portfolio New 2022-2023 Teacher KJ Val
No ratings yet
Green-Rpms Portfolio New 2022-2023 Teacher KJ Val
33 pages
Lesson Plan: Đinh Hoàng Bảo Uyên
No ratings yet
Lesson Plan: Đinh Hoàng Bảo Uyên
2 pages
Question (1) :-Answer
No ratings yet
Question (1) :-Answer
27 pages
Microsoft Word Basics
100% (1)
Microsoft Word Basics
20 pages
Barbarians Against Rome - Romes Celtic Germanic Spanish and Gallic Enemies PDF
100% (8)
Barbarians Against Rome - Romes Celtic Germanic Spanish and Gallic Enemies PDF
146 pages
This Is An Individual Weekly Activity For You To Carry Out. This Activity Should Be Completed Before The Next Class
100% (1)
This Is An Individual Weekly Activity For You To Carry Out. This Activity Should Be Completed Before The Next Class
2 pages
Esmeralda Lesson Plan Writing Procedure Text New 2
No ratings yet
Esmeralda Lesson Plan Writing Procedure Text New 2
5 pages
Online Admission System Is Aimed at Developing An Online Admission Application
No ratings yet
Online Admission System Is Aimed at Developing An Online Admission Application
24 pages
Advanced-SQL-Exercises
No ratings yet
Advanced-SQL-Exercises
3 pages
Ccs335 Lab Manual-2
No ratings yet
Ccs335 Lab Manual-2
67 pages
Plan A Windows Hello For Business Deployment - Windows Security Microsoft Learn
No ratings yet
Plan A Windows Hello For Business Deployment - Windows Security Microsoft Learn
13 pages
Block Diagram
No ratings yet
Block Diagram
21 pages
DSN4091_Capstone project Phase-I report format
No ratings yet
DSN4091_Capstone project Phase-I report format
6 pages
EXAM TEST 1z0-150-22
No ratings yet
EXAM TEST 1z0-150-22
3 pages
Tallis Spem Score - Part - 2
No ratings yet
Tallis Spem Score - Part - 2
1 page
1884 Journey From Heraut To Khiva Moscow and ST Petersburgh Vol 1 by Abbott S PDF
No ratings yet
1884 Journey From Heraut To Khiva Moscow and ST Petersburgh Vol 1 by Abbott S PDF
424 pages
2010 Book VerificationAndValidationInSys
No ratings yet
2010 Book VerificationAndValidationInSys
261 pages
A Five-Minute Guide To Better Typography - Pierrick Calvez
No ratings yet
A Five-Minute Guide To Better Typography - Pierrick Calvez
27 pages
Activtrades Smartorder 2 User Guide
No ratings yet
Activtrades Smartorder 2 User Guide
20 pages
Azure Virtual Network
100% (1)
Azure Virtual Network
595 pages
TMATS-Retail SOFTWARE
No ratings yet
TMATS-Retail SOFTWARE
2 pages
The Only Source of Wisdom Jan 18 2024
No ratings yet
The Only Source of Wisdom Jan 18 2024
3 pages
Plato Eleatic Dialogues
No ratings yet
Plato Eleatic Dialogues
198 pages
Aen 303 Second Language Acquisition
No ratings yet
Aen 303 Second Language Acquisition
54 pages
Used To Be Used To Get Used To
No ratings yet
Used To Be Used To Get Used To
3 pages
Lecture 6 - Prototyping
No ratings yet
Lecture 6 - Prototyping
20 pages
Steam Stem School Posters 2
No ratings yet
Steam Stem School Posters 2
7 pages
Vertical Subtraction of Two-Digit Numbers Lss Plan
No ratings yet
Vertical Subtraction of Two-Digit Numbers Lss Plan
6 pages
The Essential Tagore 1st Edition Rabindranath Tagore download pdf
100% (5)
The Essential Tagore 1st Edition Rabindranath Tagore download pdf
81 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

vLLM

Uploaded by

vLLM

Uploaded by

02/12/2024, 22:08 Welcome to vLLM!

Easy, fast, and cheap LLM serving for everyone

Tensor parallelism and pipeline parallelism support for distributed inference

Deploying with Nginx Loadbalancer

Guidelines for Writing Plugins

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.