Local LLM Inference and Fine-Tuning
Local LLM Inference and Fine-Tuning
Feraidoon Mehri
Outline
1 Models
2 Quantization
3 llama.cpp
4 Multi-User Inference
5 Fine-Tuning
WizardCoder
The various circles on the graph show the perplexity of different quantization mixes.
The different colors indicate the LLaMA variant (sizes) used. The solid squares in the
corresponding color represent (model size, perplexity) for the original fp16 model. The
dashed lines are added for convenience to allow for a better judgement of how closely
the quantized models approach the fp16 perplexity.
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 5/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning
Quantization Results
13B ms/tok @ 4th, Ryzen 109 118 130 156 180 414
llama.cpp
LLM inference in pure C/C++
Plain C/C++ implementation without dependencies
2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization
support
Mixed F16 / F32 precision
CUDA, Metal and OpenCL GPU backend support
Apple silicon first-class citizen - optimized via ARM NEON,
Accelerate and Metal frameworks
AVX, AVX2 and AVX512 support for x86 architectures
Install
Auto-detect hardware acceleration
pip install "llama-cpp-python[server]"
CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install "llama-cpp-python[server]"
def print_chat_streaming(output):
text = ""
for r in output:
text_current = None
choice = r["choices"][0]
if "delta" in choice:
delta = choice["delta"]
if "role" in delta:
print(f"\n{delta['role']}: ", end="")
if "content" in delta:
text_current = f"{delta['content']}"
elif "text" in choice:
text_current = f"{choice['text']}"
if text_current:
text += text_current
print(f"{text_current}", end="")
text = text.rstrip()
print("\n")
return text
llm = Llama(
model_path="/models/wizardcoder-python-34b-v1.0.Q4_K_M.gguf",
n_gpu_layers=0,
n_threads=48,
)
output = llm.create_chat_completion(
messages=[
ChatCompletionMessage(
role="user", content="Name the planets in the solar system?"
),
],
stream=True,
)
print_chat_streaming(output)
vLLM
vLLM is a fast and easy-to-use library for LLM inference and
serving.
Continuous batching of incoming requests
Streaming outputs
OpenAI-compatible API server
State-of-the-art serving throughput
Efficient management of attention key and value memory with
PagedAttention
Optimized CUDA kernels
High-throughput serving with various decoding algorithms,
including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Seamless integration with popular Hugging Face models
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 13/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning
HuggingFace TGI
TGI: A Rust, Python and gRPC server for text generation
inference. Used in production at HuggingFace to power Hugging
Chat, the Inference API and Inference Endpoint.
Serve the most popular Large Language Models with a simple
launcher
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total
throughput
Optimized transformers code for inference using flash-attention
and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Logits warper (temperature scaling, top-p, top-k, repetition
penalty, more details see transformers.LogitsProcessor)
Stop sequences
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 15/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning
HuggingFace TGI
Summary
Kaggle is Unviable
Axolotl
Axolotl is a tool designed to streamline the fine-tuning of various AI
models, offering support for multiple configurations and
architectures.
Train various Huggingface models such as llama, pythia,
falcon, mpt
Supports fullfinetune, lora, qlora, relora, and gptq
Customize configurations using a simple yaml file or CLI
overwrite
Load different dataset formats, use custom formats, or bring
your own tokenized datasets
Integrated with xformer, flash attention, rope scaling, and
multipacking
Works with single GPU or multiple GPUs via FSDP or
Deepspeed
Easily run with Docker locally or on the cloud
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 19/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning
Ludwig
OpenPipe
lxe/simple-llm-finetuner
Lightning AI Blog
Homework Proposals