100% found this document useful (3 votes)

830 views26 pages

Local LLM Inference and Fine-Tuning

The document discusses various topics related to local LLM inference and fine-tuning including available models, quantization techniques, the llama.cpp library for efficient C++ inference, multi-user inference capabilities, and fine-tuning models like WizardCoder. It also covers the vLLM library which provides fast LLM serving with features like continuous batching and streaming outputs.

Uploaded by

Peter Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

830 views26 pages

Local LLM Inference and Fine-Tuning

Uploaded by

Peter Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Models Quantization llama.

cpp Multi-User Inference Fine-Tuning

Local LLM Inference and Fine-Tuning

Feraidoon Mehri

Sharif University of Technology

September 19, 2023

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 1/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Outline

1 Models

2 Quantization

3 llama.cpp

4 Multi-User Inference

5 Fine-Tuning

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 2/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Good Models Available

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 3/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

WizardCoder

WizardCoder: Empowering Code Large Language Models with

Evol-Instruct
Code Large Language Models (Code LLMs), such as StarCoder,
have demonstrated exceptional performance in code-related tasks.
However, most existing models are solely pre-trained on extensive
raw code data without instruction fine-tuning. In this paper, we
introduce WizardCoder, which empowers Code LLMs with complex
instruction fine-tuning, by adapting the Evol-Instruct method to the
domain of code. Through comprehensive experiments on four
prominent code generation benchmarks, namely HumanEval,
HumanEval+, MBPP, and DS-1000, we unveil the exceptional
capabilities of our model.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 4/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Perplexity on Wikitext as a Function of Model Size

The various circles on the graph show the perplexity of different quantization mixes.
The different colors indicate the LLaMA variant (sizes) used. The solid squares in the
corresponding color represent (model size, perplexity) for the original fp16 model. The
dashed lines are added for convenience to allow for a better judgement of how closely
the quantized models approach the fp16 perplexity.
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 5/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Relative Quantization Error

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 6/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Quantization Results

Model Measure Q2_K Q3_K_M Q4_K_S Q5_K_S Q6_K F16

7B perplexity 6.7764 6.1503 6.0215 5.9419 5.9110 5.9066

7B file size 2.67G 3.06G 3.56G 4.33G 5.15G 13.0G

7B ms/tok @ 4th, M2 Max 56 69 50 70 75 116

7B ms/tok @ 8th, M2 Max 36 36 36 44 51 111

7B ms/tok @ 4th, RTX-4080 15.5 17.0 15.5 16.7 18.3 60

7B ms/tok @ 4th, Ryzen 57 61 68 81 93 214

13B perplexity 5.8545 5.4498 5.3404 5.2785 5.2568 5.2543

13B file size 5.13G 5.88G 6.80G 8.36G 9.95G 25.0G

13B ms/tok @ 4th, M2 Max 103 148 95 132 142 216

13B ms/tok @ 8th, M2 Max 67 77 68 81 95 213

13B ms/tok @ 4th, RTX-4080 25.3 29.3 26.2 28.6 30.0 -

13B ms/tok @ 4th, Ryzen 109 118 130 156 180 414

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 7/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Time Per Token for WizardCoder-Python-34B

Model wizardcoder-python-34b
Quant Method Q4_K_M
Size 20.22 GB
Max RAM Required 22.72 GB
Description medium, balanced quality - recommended
ms/token CPU 32 cores 316.71
ms/token CPU 1 core 3747.27
ms/token GPU 43.50
Note: the above RAM figures assume no GPU offloading. If layers
are offloaded to the GPU, this will reduce RAM usage and use
VRAM instead.
Time per token measured on:
Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Quadro RTX 6000 24GB, compute capability 7.5

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 8/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

llama.cpp
LLM inference in pure C/C++
Plain C/C++ implementation without dependencies
2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization
support
Mixed F16 / F32 precision
CUDA, Metal and OpenCL GPU backend support
Apple silicon first-class citizen - optimized via ARM NEON,
Accelerate and Metal frameworks
AVX, AVX2 and AVX512 support for x86 architectures

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 9/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

abetlen/llama-cpp-python: Python bindings for llama.cpp

Install
Auto-detect hardware acceleration
pip install "llama-cpp-python[server]"
CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install "llama-cpp-python[server]"

Start an OpenAI-compatible API server

python -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers -1

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 10/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

llama-cpp-python: Python API

def print_chat_streaming(output):
text = ""
for r in output:
text_current = None
choice = r["choices"][0]
if "delta" in choice:
delta = choice["delta"]
if "role" in delta:
print(f"\n{delta['role']}: ", end="")
if "content" in delta:
text_current = f"{delta['content']}"
elif "text" in choice:
text_current = f"{choice['text']}"
if text_current:
text += text_current
print(f"{text_current}", end="")

text = text.rstrip()
print("\n")
return text

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 11/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

llama-cpp-python: Python API

from llama_cpp import Llama, ChatCompletionMessage

llm = Llama(
model_path="/models/wizardcoder-python-34b-v1.0.Q4_K_M.gguf",
n_gpu_layers=0,
n_threads=48,
)

output = llm.create_chat_completion(
messages=[
ChatCompletionMessage(
role="user", content="Name the planets in the solar system?"
),
],
stream=True,
)
print_chat_streaming(output)

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 12/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

vLLM
vLLM is a fast and easy-to-use library for LLM inference and
serving.
Continuous batching of incoming requests
Streaming outputs
OpenAI-compatible API server
State-of-the-art serving throughput
Efficient management of attention key and value memory with
PagedAttention
Optimized CUDA kernels
High-throughput serving with various decoding algorithms,
including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Seamless integration with popular Hugging Face models
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 13/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

vLLM Vs. HuggingFace TGI

According to vLLM authors, vLLM outperforms Hugging Face
Transformers (HF) by up to 24x and Text Generation Inference
(TGI) by up to 3.5x, in terms of throughput.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 14/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

HuggingFace TGI
TGI: A Rust, Python and gRPC server for text generation
inference. Used in production at HuggingFace to power Hugging
Chat, the Inference API and Inference Endpoint.
Serve the most popular Large Language Models with a simple
launcher
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total
throughput
Optimized transformers code for inference using flash-attention
and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Logits warper (temperature scaling, top-p, top-k, repetition
penalty, more details see transformers.LogitsProcessor)
Stop sequences
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 15/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

HuggingFace TGI

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 16/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Summary

A server with 32 CPU cores and roughly 25GB RAM can do

inference in almost real-time for a single user.
A GPU server with 24GB VRAM is faster than real-time.
Using an efficient multi-user API server, it might be possible to
service multiple users using a single GPU server.
Ideally, HPC should sell token credits for a big LLM through an
authenticated API server.
This is useful for research and general usage.
Free online alternatives such as Claude V2 and ChatGPT 3.5
are still faster and better than the local models available.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 17/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Kaggle is Unviable

Kaggle only provides 20 GB of HDD, which is not enough to

download the model weights.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 18/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Axolotl
Axolotl is a tool designed to streamline the fine-tuning of various AI
models, offering support for multiple configurations and
architectures.
Train various Huggingface models such as llama, pythia,
falcon, mpt
Supports fullfinetune, lora, qlora, relora, and gptq
Customize configurations using a simple yaml file or CLI
overwrite
Load different dataset formats, use custom formats, or bring
your own tokenized datasets
Integrated with xformer, flash attention, rope scaling, and
multipacking
Works with single GPU or multiple GPUs via FSDP or
Deepspeed
Easily run with Docker locally or on the cloud
Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 19/26
Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Ludwig

Ludwig is a low-code framework for building custom AI models like

LLMs and other deep neural networks.
Build custom models with ease: a declarative YAML
configuration file is all you need to train a state-of-the-art
LLM on your data. Support for multi-task and multi-modality
learning.
Optimized for scale and efficiency: automatic batch size
selection, distributed training (DDP, DeepSpeed), parameter
efficient fine-tuning (PEFT), 4-bit quantization (QLoRA), and
larger-than-memory datasets.
Expert level control: retain full control of your models down to
the activation functions. Support for hyperparameter
optimization, explainability, and rich metric visualizations.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 20/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Ludwig: An Example Notebook

QLORA fine-tune of LLAMA2-7b on a combined Math and

Physics question-answers dataset, using open-source Lug-
wig library on colab which provide single T4 GPU of 16 GB
of RAM
Google Colaboratory
Dataset links:
Maths: https://lnkd.in/gQKYGeST
Physics: https://lnkd.in/gTjmdESv

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 21/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

OpenPipe

Fine-tune your own Llama 2 to replace GPT-3.5/4 | Hacker

News
OpenPipe/examples/classify-recipes at main ·
OpenPipe/OpenPipe

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 22/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Replicate Blog Posts

How to use Alpaca-LoRA to fine-tune a model like ChatGPT -

Replicate – Replicate
Fine-tune Llama 2 on Replicate - Replicate – Replicate
Make any large language model a better poet - Replicate –
Replicate
Fine-tune LLaMA to speak like Homer Simpson - Replicate –
Replicate

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 23/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

lxe/simple-llm-finetuner

lxe/simple-llm-finetuner: Simple UI for LLM Model Finetuning

Show HN: Finetune LLaMA-7B on commodity GPUs using
your own text | Hacker News

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 24/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Lightning AI Blog

How To Finetune GPT Like Large Language Models on a

Custom Dataset - Lightning AI
How to Finetune GPT-Like Large Language Models on a
Custom Dataset | Hacker News

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 25/26

Models Quantization llama.cpp Multi-User Inference Fine-Tuning

Homework Proposals

Normal fine-tuning and even PEFT is (half-heartedly) taught

in many courses in the department.
Methods and tools for efficient training on a single GPU
A focus on quantized models and models needing optimized
CPP runtimes is recommended for the LLM course.
This expertise is lacking in the teaching team.
The best compromise might be to use the mentioned
fine-tuning libraries.
We can also ask the students to quantize a model like BERT
and then fine-tune that using various PEFT methods
(implemented from scratch).
Not as interesting as fine-tuning GPT-2
But easier to evaluate
It might not be possible to use such Python PEFT
implementations for real LLMs.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 26/26

Building AutoGPT With Llama 3.1
No ratings yet
Building AutoGPT With Llama 3.1
293 pages
Mastering AI Agents
100% (4)
Mastering AI Agents
93 pages
LLM Fine Tuning
No ratings yet
LLM Fine Tuning
1 page
300 LangChain Projects
100% (1)
300 LangChain Projects
17 pages
Grade 8 Computer Question Bank
No ratings yet
Grade 8 Computer Question Bank
3 pages
UNIT - 1 PPT - DBMS - BSC
No ratings yet
UNIT - 1 PPT - DBMS - BSC
27 pages
Agentic AI Projects
33% (3)
Agentic AI Projects
9 pages
Packet Tracer Commands - CCNA
No ratings yet
Packet Tracer Commands - CCNA
16 pages
Generative AI Apps With Langchain and Python - Rabi Jay
100% (1)
Generative AI Apps With Langchain and Python - Rabi Jay
387 pages
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
100% (1)
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
39 pages
Final Year Sample Report
No ratings yet
Final Year Sample Report
49 pages
7 Agentic RAG System Architectures To Build AI Agents
100% (1)
7 Agentic RAG System Architectures To Build AI Agents
12 pages
Generative AI On AWS
100% (6)
Generative AI On AWS
208 pages
LLM Intro
No ratings yet
LLM Intro
8 pages
Building Machine Learning Systems With A Feature Store - Early Release
100% (2)
Building Machine Learning Systems With A Feature Store - Early Release
48 pages
LLM Training Update
100% (1)
LLM Training Update
31 pages
Generative Ai Terminology
67% (3)
Generative Ai Terminology
26 pages
Past Questions Main
No ratings yet
Past Questions Main
61 pages
Microcontroller Question Bank
No ratings yet
Microcontroller Question Bank
5 pages
RAG and LangChain Loading Documents Round1
No ratings yet
RAG and LangChain Loading Documents Round1
8 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
(English (Auto-Generated) ) Build and Run A Medical Chatbot Using Llama 2 On CPU Machine - All Open Source (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Build and Run A Medical Chatbot Using Llama 2 On CPU Machine - All Open Source (DownSub - Com)
52 pages
Code, Et Tu - LLM, Transformer, RAG AI - Mastering Large Language Models, Transformer Models, and Retrieval-Augmented Generation (RAG) Technology (2024)
100% (2)
Code, Et Tu - LLM, Transformer, RAG AI - Mastering Large Language Models, Transformer Models, and Retrieval-Augmented Generation (RAG) Technology (2024)
317 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
LLM Questions
100% (1)
LLM Questions
51 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
System Development Life Cycle
100% (2)
System Development Life Cycle
3 pages
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning - by Gao Dalie (高達烈) - in Towards AI - Freedium
No ratings yet
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning - by Gao Dalie (高達烈) - in Towards AI - Freedium
13 pages
Building LLM Applications For Production
100% (3)
Building LLM Applications For Production
28 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
EEI3346 Final Written Paper - 9th January2022
No ratings yet
EEI3346 Final Written Paper - 9th January2022
10 pages
RAG Architecture
100% (8)
RAG Architecture
52 pages
Lab 002
No ratings yet
Lab 002
5 pages
Data Sharing Collaboration Delta Sharing Final
No ratings yet
Data Sharing Collaboration Delta Sharing Final
127 pages
Vector Database Essentials
No ratings yet
Vector Database Essentials
26 pages
How AI Agents Are Reshaping The Future of Work 1730986106
100% (4)
How AI Agents Are Reshaping The Future of Work 1730986106
18 pages
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
100% (1)
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
27 pages
g915 g913 TKL QSG
No ratings yet
g915 g913 TKL QSG
333 pages
Eds500 Series
No ratings yet
Eds500 Series
46 pages
BDS602 Module 2 PDF
No ratings yet
BDS602 Module 2 PDF
16 pages
RAG Technics
100% (1)
RAG Technics
8 pages
A Taxonomy of Retrieval Augmented Generation
100% (2)
A Taxonomy of Retrieval Augmented Generation
56 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
LLM Applications
100% (1)
LLM Applications
1 page
Emerging Trends MCQ
100% (1)
Emerging Trends MCQ
18 pages
Generative AI With Large Language Models
100% (3)
Generative AI With Large Language Models
31 pages
High-Speed Inference With Llama - CPP and Vicuna On CPU by Benjamin Marie Jun, 2023 Towards AI
No ratings yet
High-Speed Inference With Llama - CPP and Vicuna On CPU by Benjamin Marie Jun, 2023 Towards AI
21 pages
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
50% (2)
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
21 pages
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
100% (2)
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
134 pages
Apress Understanding Large Language Models B0CJ2C8TXQ
100% (11)
Apress Understanding Large Language Models B0CJ2C8TXQ
166 pages
Fine-Tuning Pre-Trained Models For Generative AI Applications
100% (2)
Fine-Tuning Pre-Trained Models For Generative AI Applications
19 pages
Aws Certified Solutions Architect Associate Study Guide B08L1BC3QR
No ratings yet
Aws Certified Solutions Architect Associate Study Guide B08L1BC3QR
91 pages
Mlops Ebook With Preview
67% (3)
Mlops Ebook With Preview
57 pages
26 RAG Concepts in Alphabetical Order
No ratings yet
26 RAG Concepts in Alphabetical Order
15 pages
Config Idevice Standard DOCU V1d0 en
No ratings yet
Config Idevice Standard DOCU V1d0 en
44 pages
Hands-On Guide To Agentic Corrective RAG-1
No ratings yet
Hands-On Guide To Agentic Corrective RAG-1
5 pages
Large Language Models
100% (1)
Large Language Models
23 pages
Databricks Big Book of GenAI FINAL
100% (7)
Databricks Big Book of GenAI FINAL
118 pages
Diffusion
100% (5)
Diffusion
62 pages
Network Slicing in 5G
No ratings yet
Network Slicing in 5G
14 pages
Database Programming With SQL Section 10 Quiz
No ratings yet
Database Programming With SQL Section 10 Quiz
20 pages
A Developer's Guide To Building AI Applications: Second Edition
100% (5)
A Developer's Guide To Building AI Applications: Second Edition
46 pages
Building LLM Powered Applications With Langchain
100% (1)
Building LLM Powered Applications With Langchain
11 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
Artificial Intelligence: Machine Learning Algorithms Id3 Dbscan
No ratings yet
Artificial Intelligence: Machine Learning Algorithms Id3 Dbscan
30 pages
TURNIRIN TP060237 - Individual Assignment
No ratings yet
TURNIRIN TP060237 - Individual Assignment
29 pages
Coc Gab Question
No ratings yet
Coc Gab Question
3 pages
Mysql (Create, Insert, Select, Update, Delete)
No ratings yet
Mysql (Create, Insert, Select, Update, Delete)
7 pages
Grade XI: Computer Science Project Work: Submitted By: Rashihang Rai
No ratings yet
Grade XI: Computer Science Project Work: Submitted By: Rashihang Rai
21 pages
Async Tasks With Apache Airflow
No ratings yet
Async Tasks With Apache Airflow
111 pages
Asterisk Gateway Interface (Agi) Scripting in Python: Muhammad Morshed Alam Amberit LTD
No ratings yet
Asterisk Gateway Interface (Agi) Scripting in Python: Muhammad Morshed Alam Amberit LTD
19 pages
Enquiry Letter
No ratings yet
Enquiry Letter
11 pages
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
100% (1)
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
28 pages
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
100% (1)
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
21 pages
Exp 1
No ratings yet
Exp 1
9 pages
RAG - A Simple Introduction
100% (5)
RAG - A Simple Introduction
75 pages
51 C Arm Machine
No ratings yet
51 C Arm Machine
3 pages
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
No ratings yet
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
18 pages
Purchase Order Version Management - S - 4HANA Materials Management
No ratings yet
Purchase Order Version Management - S - 4HANA Materials Management
18 pages
Vector Databases
No ratings yet
Vector Databases
35 pages
Lab 4 - SQLStatements
No ratings yet
Lab 4 - SQLStatements
3 pages
Ad SW Final Revision Essay Question
No ratings yet
Ad SW Final Revision Essay Question
4 pages
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
No ratings yet
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
13 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
Current Best Practices For Training LLMs From Scratch - Final
No ratings yet
Current Best Practices For Training LLMs From Scratch - Final
23 pages
Types of RAG: @bhavishya Pandit
No ratings yet
Types of RAG: @bhavishya Pandit
15 pages
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
100% (2)
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
48 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Local LLM Inference and Fine-Tuning

Uploaded by

Local LLM Inference and Fine-Tuning

Uploaded by

Models Quantization llama.

cpp Multi-User Inference Fine-Tuning

Local LLM Inference and Fine-Tuning

Sharif University of Technology

September 19, 2023

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 1/26

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 2/26

Good Models Available

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 3/26

WizardCoder: Empowering Code Large Language Models with

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 4/26

Perplexity on Wikitext as a Function of Model Size

Relative Quantization Error

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 6/26

Model Measure Q2_K Q3_K_M Q4_K_S Q5_K_S Q6_K F16

7B perplexity 6.7764 6.1503 6.0215 5.9419 5.9110 5.9066

7B file size 2.67G 3.06G 3.56G 4.33G 5.15G 13.0G

7B ms/tok @ 4th, M2 Max 56 69 50 70 75 116

7B ms/tok @ 8th, M2 Max 36 36 36 44 51 111

7B ms/tok @ 4th, RTX-4080 15.5 17.0 15.5 16.7 18.3 60

7B ms/tok @ 4th, Ryzen 57 61 68 81 93 214

13B perplexity 5.8545 5.4498 5.3404 5.2785 5.2568 5.2543

13B file size 5.13G 5.88G 6.80G 8.36G 9.95G 25.0G

13B ms/tok @ 4th, M2 Max 103 148 95 132 142 216

13B ms/tok @ 8th, M2 Max 67 77 68 81 95 213

13B ms/tok @ 4th, RTX-4080 25.3 29.3 26.2 28.6 30.0 -

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 7/26

Time Per Token for WizardCoder-Python-34B

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 8/26

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 9/26

abetlen/llama-cpp-python: Python bindings for llama.cpp

Start an OpenAI-compatible API server

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 10/26

llama-cpp-python: Python API

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 11/26

llama-cpp-python: Python API

from llama_cpp import Llama, ChatCompletionMessage

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 12/26

vLLM Vs. HuggingFace TGI

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 14/26

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 16/26

A server with 32 CPU cores and roughly 25GB RAM can do

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 17/26

Kaggle only provides 20 GB of HDD, which is not enough to

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 18/26

Ludwig is a low-code framework for building custom AI models like

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 20/26

Ludwig: An Example Notebook

QLORA fine-tune of LLAMA2-7b on a combined Math and

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 21/26

Fine-tune your own Llama 2 to replace GPT-3.5/4 | Hacker

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 22/26

Replicate Blog Posts

How to use Alpaca-LoRA to fine-tune a model like ChatGPT -

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 23/26

lxe/simple-llm-finetuner: Simple UI for LLM Model Finetuning

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 24/26

How To Finetune GPT Like Large Language Models on a

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 25/26

Normal fine-tuning and even PEFT is (half-heartedly) taught

Feraidoon Mehri Local LLM Inference and Fine-Tuning Sharif U. T. 26/26

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.