100% found this document useful (3 votes)
830 views26 pages

Local LLM Inference and Fine-Tuning

The document discusses various topics related to local LLM inference and fine-tuning including available models, quantization techniques, the llama.cpp library for efficient C++ inference, multi-user inference capabilities, and fine-tuning models like WizardCoder. It also covers the vLLM library which provides fast LLM serving with features like continuous batching and streaming outputs.

Uploaded by

Peter Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
830 views26 pages

Local LLM Inference and Fine-Tuning

The document discusses various topics related to local LLM inference and fine-tuning including available models, quantization techniques, the llama.cpp library for efficient C++ inference, multi-user inference capabilities, and fine-tuning models like WizardCoder. It also covers the vLLM library which provides fast LLM serving with features like continuous batching and streaming outputs.

Uploaded by

Peter Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Models Quantization llama.

cpp Multi-User Inference Fine-Tuning

Local LLM Inference and Fine-Tuning

Feraidoon Mehri

Sharif University of Technology

September 19, 2023

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 1/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Outline

1 Models

2 Quantization

3 llama.cpp

4 Multi-User Inference

5 Fine-Tuning

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 2/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Good Models Available

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 3/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

WizardCoder

WizardCoder: Empowering Code Large Language Models with


Evol-Instruct
Code Large Language Models (Code LLMs), such as StarCoder,
have demonstrated exceptional performance in code-related tasks.
However, most existing models are solely pre-trained on extensive
raw code data without instruction fine-tuning. In this paper, we
introduce WizardCoder, which empowers Code LLMs with complex
instruction fine-tuning, by adapting the Evol-Instruct method to the
domain of code. Through comprehensive experiments on four
prominent code generation benchmarks, namely HumanEval,
HumanEval+, MBPP, and DS-1000, we unveil the exceptional
capabilities of our model.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 4/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Perplexity on Wikitext as a Function of Model Size

The various circles on the graph show the perplexity of different quantization mixes.
The different colors indicate the LLaMA variant (sizes) used. The solid squares in the
corresponding color represent (model size, perplexity) for the original fp16 model. The
dashed lines are added for convenience to allow for a better judgement of how closely
the quantized models approach the fp16 perplexity.
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 5/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Relative Quantization Error

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 6/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Quantization Results

Model Measure Q2_K Q3_K_M Q4_K_S Q5_K_S Q6_K F16

7B perplexity 6.7764 6.1503 6.0215 5.9419 5.9110 5.9066

7B file size 2.67G 3.06G 3.56G 4.33G 5.15G 13.0G

7B ms/tok @ 4th, M2 Max 56 69 50 70 75 116

7B ms/tok @ 8th, M2 Max 36 36 36 44 51 111

7B ms/tok @ 4th, RTX-4080 15.5 17.0 15.5 16.7 18.3 60

7B ms/tok @ 4th, Ryzen 57 61 68 81 93 214

13B perplexity 5.8545 5.4498 5.3404 5.2785 5.2568 5.2543

13B file size 5.13G 5.88G 6.80G 8.36G 9.95G 25.0G

13B ms/tok @ 4th, M2 Max 103 148 95 132 142 216

13B ms/tok @ 8th, M2 Max 67 77 68 81 95 213

13B ms/tok @ 4th, RTX-4080 25.3 29.3 26.2 28.6 30.0 -

13B ms/tok @ 4th, Ryzen 109 118 130 156 180 414

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 7/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Time Per Token for WizardCoder-Python-34B


Model wizardcoder-python-34b
Quant Method Q4_K_M
Size 20.22 GB
Max RAM Required 22.72 GB
Description medium, balanced quality - recommended
ms/token CPU 32 cores 316.71
ms/token CPU 1 core 3747.27
ms/token GPU 43.50
Note: the above RAM figures assume no GPU offloading. If layers
are offloaded to the GPU, this will reduce RAM usage and use
VRAM instead.
Time per token measured on:
Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Quadro RTX 6000 24GB, compute capability 7.5

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 8/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

llama.cpp
LLM inference in pure C/C++
Plain C/C++ implementation without dependencies
2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization
support
Mixed F16 / F32 precision
CUDA, Metal and OpenCL GPU backend support
Apple silicon first-class citizen - optimized via ARM NEON,
Accelerate and Metal frameworks
AVX, AVX2 and AVX512 support for x86 architectures

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 9/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

abetlen/llama-cpp-python: Python bindings for llama.cpp

Install
Auto-detect hardware acceleration
pip install "llama-cpp-python[server]"
CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install "llama-cpp-python[server]"

Start an OpenAI-compatible API server


python -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers -1

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 10/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

llama-cpp-python: Python API

def print_chat_streaming(output):
text = ""
for r in output:
text_current = None
choice = r["choices"][0]
if "delta" in choice:
delta = choice["delta"]
if "role" in delta:
print(f"\n{delta['role']}: ", end="")
if "content" in delta:
text_current = f"{delta['content']}"
elif "text" in choice:
text_current = f"{choice['text']}"
if text_current:
text += text_current
print(f"{text_current}", end="")

text = text.rstrip()
print("\n")
return text

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 11/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

llama-cpp-python: Python API

from llama_cpp import Llama, ChatCompletionMessage

llm = Llama(
model_path="/models/wizardcoder-python-34b-v1.0.Q4_K_M.gguf",
n_gpu_layers=0,
n_threads=48,
)

output = llm.create_chat_completion(
messages=[
ChatCompletionMessage(
role="user", content="Name the planets in the solar system?"
),
],
stream=True,
)
print_chat_streaming(output)

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 12/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

vLLM
vLLM is a fast and easy-to-use library for LLM inference and
serving.
Continuous batching of incoming requests
Streaming outputs
OpenAI-compatible API server
State-of-the-art serving throughput
Efficient management of attention key and value memory with
PagedAttention
Optimized CUDA kernels
High-throughput serving with various decoding algorithms,
including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Seamless integration with popular Hugging Face models
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 13/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

vLLM Vs. HuggingFace TGI


According to vLLM authors, vLLM outperforms Hugging Face
Transformers (HF) by up to 24x and Text Generation Inference
(TGI) by up to 3.5x, in terms of throughput.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 14/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

HuggingFace TGI
TGI: A Rust, Python and gRPC server for text generation
inference. Used in production at HuggingFace to power Hugging
Chat, the Inference API and Inference Endpoint.
Serve the most popular Large Language Models with a simple
launcher
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total
throughput
Optimized transformers code for inference using flash-attention
and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Logits warper (temperature scaling, top-p, top-k, repetition
penalty, more details see transformers.LogitsProcessor)
Stop sequences
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 15/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

HuggingFace TGI

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 16/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Summary

A server with 32 CPU cores and roughly 25GB RAM can do


inference in almost real-time for a single user.
A GPU server with 24GB VRAM is faster than real-time.
Using an efficient multi-user API server, it might be possible to
service multiple users using a single GPU server.
Ideally, HPC should sell token credits for a big LLM through an
authenticated API server.
This is useful for research and general usage.
Free online alternatives such as Claude V2 and ChatGPT 3.5
are still faster and better than the local models available.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 17/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Kaggle is Unviable

Kaggle only provides 20 GB of HDD, which is not enough to


download the model weights.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 18/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Axolotl
Axolotl is a tool designed to streamline the fine-tuning of various AI
models, offering support for multiple configurations and
architectures.
Train various Huggingface models such as llama, pythia,
falcon, mpt
Supports fullfinetune, lora, qlora, relora, and gptq
Customize configurations using a simple yaml file or CLI
overwrite
Load different dataset formats, use custom formats, or bring
your own tokenized datasets
Integrated with xformer, flash attention, rope scaling, and
multipacking
Works with single GPU or multiple GPUs via FSDP or
Deepspeed
Easily run with Docker locally or on the cloud
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 19/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Ludwig

Ludwig is a low-code framework for building custom AI models like


LLMs and other deep neural networks.
Build custom models with ease: a declarative YAML
configuration file is all you need to train a state-of-the-art
LLM on your data. Support for multi-task and multi-modality
learning.
Optimized for scale and efficiency: automatic batch size
selection, distributed training (DDP, DeepSpeed), parameter
efficient fine-tuning (PEFT), 4-bit quantization (QLoRA), and
larger-than-memory datasets.
Expert level control: retain full control of your models down to
the activation functions. Support for hyperparameter
optimization, explainability, and rich metric visualizations.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 20/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Ludwig: An Example Notebook

QLORA fine-tune of LLAMA2-7b on a combined Math and


Physics question-answers dataset, using open-source Lug-
wig library on colab which provide single T4 GPU of 16 GB
of RAM
Google Colaboratory
Dataset links:
Maths: https://lnkd.in/gQKYGeST
Physics: https://lnkd.in/gTjmdESv

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 21/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

OpenPipe

Fine-tune your own Llama 2 to replace GPT-3.5/4 | Hacker


News
OpenPipe/examples/classify-recipes at main ·
OpenPipe/OpenPipe

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 22/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Replicate Blog Posts

How to use Alpaca-LoRA to fine-tune a model like ChatGPT -


Replicate – Replicate
Fine-tune Llama 2 on Replicate - Replicate – Replicate
Make any large language model a better poet - Replicate –
Replicate
Fine-tune LLaMA to speak like Homer Simpson - Replicate –
Replicate

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 23/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

lxe/simple-llm-finetuner

lxe/simple-llm-finetuner: Simple UI for LLM Model Finetuning


Show HN: Finetune LLaMA-7B on commodity GPUs using
your own text | Hacker News

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 24/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Lightning AI Blog

How To Finetune GPT Like Large Language Models on a


Custom Dataset - Lightning AI
How to Finetune GPT-Like Large Language Models on a
Custom Dataset | Hacker News

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 25/26


Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Homework Proposals

Normal fine-tuning and even PEFT is (half-heartedly) taught


in many courses in the department.
Methods and tools for efficient training on a single GPU
A focus on quantized models and models needing optimized
CPP runtimes is recommended for the LLM course.
This expertise is lacking in the teaching team.
The best compromise might be to use the mentioned
fine-tuning libraries.
We can also ask the students to quantize a model like BERT
and then fine-tune that using various PEFT methods
(implemented from scratch).
Not as interesting as fine-tuning GPT-2
But easier to evaluate
It might not be possible to use such Python PEFT
implementations for real LLMs.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 26/26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy