0% found this document useful (0 votes)
59 views75 pages

Own Your AI - Tech Deck

Uploaded by

Adam Hunt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views75 pages

Own Your AI - Tech Deck

Uploaded by

Adam Hunt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Technical deck

Friendly reminder:

The choice of Math, ML, and AI topics we can discuss is endless.

We have one evening and will start with only the ultra hot open-source topics.
Why all the hype?
Why all the hype?
Agenda

❑ Practical open-source AI resources - datasets, tools, models


❑ How to start on your PC today
❑ Open AI platform architectures - from on-device to hybrid local/remote
❑ From PoC to pilot to production - Edge to Cloud AI platforms
❑ End-2-End performance optimization
❑ Security for AI platforms
❑ Beyond the wrappers, RAG, and prompt engineering - advanced AI systems engineering
❑ Practical use cases
AI/ML did not happen overnight

Prehistoric • 1950s Machine Translation

Stone Age • 1980s Knowledge-Based Systems

Bronze Age • 1993 – 2012 - Statistical Era

Iron Age • 2013 – 2017 - Special Purpose, Deep Learning ML

Modern Age • 2018 – Present – Generative AI, Foundation Models, LLMs


Transformers Era

2017 – Present
“Attention is All You Need paper” in 2017
❑Type of Deep Neural Network
❑Leverages Attention/Self-Attention, including multi-head attention
❑Expressive: Feed-forward;
❑Optimizable: Backpropagation, Gradient Descent;
❑Efficient: High Parallelism compute graph
❑Examples: LLaMA-3, phi-3, GPT-4, Claude-3
❑Learn all from Karpathy https://www.youtube.com/watch?v=zjkBMFhNj_g
Brief New Age Tech Glossary

❑ Transformers: A type of general-purpose neural network architecture that facilitates the modeling of sequences without the
need for recurrent connections, prominently used in language processing tasks
❑ Foundational Model: A large-scale model that is trained on vast amounts of data and can be fine-tuned for a variety of
downstream tasks, serving as a base for further specialized models
❑ Large Language Model: A substantial neural network model trained on extensive textual data to understand and generate
human-like text across many languages and contexts. Small Language Model: A more compact version of a language model
designed for efficiency and lower resource consumption while performing natural language processing tasks
❑ Visual Language Model: A model that combines language and vision processing to understand and generate content related to
both text and images
❑ Multimodal Models: AI models that can process and understand information from different types of data, such as text, images,
and audio, simultaneously
❑ RWKV (RwaKuv): A variant of a recurrent neural network, which stands for "Reduced Weight KneeV", designed for efficiency and
performance in sequence modeling tasks
❑ Mamba/Jamba, Hawk/Griffin, DPO, DORPO, Flash Attention…
Where does open-source AI live

https://HuggingFace.co – models, data, research papers, AI social network, compute…

https://arXiv.org - Research Papers

https://Github.com – All the source code in one Place

https://github.com/ggerganov/llama.cpp - local AI on your CPU, GPU „Хайде наште!“

https://Discord.com – almost all projects have a channel

https://x.com – social network for emerging AI/ML devs, researchers, companies

https://colab.research.google.com/ - ‘Free’ compute and managed Jupyter notebooks


WTF is Hugging Face?

base slides by Thomas Wolf


Hugging Face: The home of open AI/ML

130K+
public data sets

1M+
daily downloads

Founded In
2016 700K+
daily visitors

170
Employees 30+
300K+ 600K+ Libraries

stars on Github open source models


When the AI World Stopped

130K+
public data sets

1M+
daily downloads

Founded In
2016 700K+
daily visitors

170
Employees 30+
300K+ 600K+ Libraries

stars on Github open source models


Used everywhere in the AI world

15,000+ startups and enterprises

Open-source contributors Cloud partners


Hugging Face

Hardware partners On-prem partners


The Global AI/ML Ecosystem of

Model in
production NVIDIA Hugging Face Amazon Google
DGX Cloud on Azure SageMaker Cloud
Managed Open Datasets
Inference on
AWS, Azure and 130,000+ datasets
GCP on the hub

Cloud Platforms
Deploy
anywhere

No-code AutoML
600,000+ models
Transformers on the hub

Accelerate
Open Models
Diffusers

Hosted ML applications Optimum


HW-accelerated
training & inference
Open-Source Ecosystem

Back to OSS license!


Open vs Closed Models
Open and closed models have different benefits and should be considered for each use-case

Open-Source Closed/Proprietary
Models cannot be self-hosted. Data is sent outside your
Models can be self-hosted, data stays in your environment
Security environment to vendor(s)
Updates and changes to performance are happening
The lifecycle is controlled by you
Control without notice

Open Weights and sometimes open code access to customize the


Limited ability to customize for your needs
Customization model for your needs

Inspect code and data provides better auditability and


No ability to audit or understand performance
Transparency understandability

larger model size and proprietary premium often balanced by


Typical lower long term cost due to smaller model size
Cost decreased cost from server-side optimization

Lower latency due to on premise and smaller model sizes Often greater latency due to larger model sizes + API latency
Latency

No single approach is best. Each use case will vary. Proprietary is typically closer to the frontier of performance.
Quality

Examples
Energy/carbon footprint and LLMs
Start by test existing models on your domain and task(s) of interest

Most of the time the answer is “no” =>


Focus on efficient fine-tuning and Yes => Focus on efficient pretraining while taking a holistic view of
inference Do you need to model life-cycle

pretrain an LLM?

Energy budget will likely be dominated by Train-compute-optimal models Train smaller models for
inference costs. (Chinchilla law) are not longer if you plan to
Select a compute efficient model: efficient for inference deploy it at large scale
- smallest size
- quantized
- classification models > generative Train in a local cluster/provider with a good energy mix

Deploy it in an on-prem setup/cloud


provider in a region with a good energy Share the model so people can re-use/leverage the compute
mix spent – It’s like recycling AI models
Ressources:
- Power Hungry Processing: Watts Driving the Cost of AI Deployment? https://arxiv.org/abs/2311.16863
- Language models scale reliably with over-training and on downstream tasks https://arxiv.org/abs/2403.08540
- Region energy mix (e.g. solar, nuclear, coal, gas) can have a x500 impact on model carbon footprint: https://app.electricitymaps.com/
How to start on your PC
today
llama.cpp (Made in Sofia)

50+
Other project Integrations

40+ Examples

Started In
2023 7.8K+ Forks

664
Contributors
55.1K+ 180+
stars on Github Active PRs
MLX (Apple owned Open Source)

20+
Other project Integrations

15+ Examples

Started In
2023 770+ Forks

95
Contributors
13.8K+ 10+
stars on Github Active PRs
One Click Tools
CoreNet: Train SLM on your Mac

3
Other project Integrations

3
Examples

84
Started In Forks

2024

2K+
Stars on Github
6+
Active PRs
Mergekit Evolve

10+
Other project Integrations

5
Examples

Started In
2023 272 Forks

16
Contributors
3.4K+
Stars on Github
10+
Active PRs
Open AI Platform
Architecture
Open AI Platform
Data Lakehouse
Delta Lake, Spark, Embeddings Model Embeddings DB IaC
Your Own Data Trino E5 Mistral Quant OpenTofu

DevSecOps

Model
Orchestration APIs/Plugins
Factory Playground Guardrails
Routing Open
Vault
Synthetic DSPy, distilabel DSPy OpenBao
DSPy CodeInterpreter
Data Pipeline

Hybrid Identity Service


Keycloak

Query/ Local
In-mem, Databases Guardrails Service
Orchestration/Router
API Call DSPy, AICI
Memmap, pgvector DSPy
Base Inferencing Backend Function Calling Backends
Output/ Llama-3 70B Phi-3
Base Model/Function Lifecycle /Control Plane
API Call calling Agent LLMOps
Llama-3 8B/Phi-3 Rust Phoenix
Backend API Services CodeGen Backeng
App/API/Inference on CPU and GPU FastAPI Wavecoder Ultra
Local Platform on Computer/Edge GW/Phone Cache
Valkey Backend Inferencing and APIs on GPU, CPU, IPU, NPU
Open AI Project Team
Data Lakehouse
Delta Lake, Spark, Embeddings Model Embeddings DB IaC
Your Own Data Trino E5 Mistral Quant OpenTofu

DevSecOps

Model
Orchestration APIs/Plugins
Factory Playground Guardrails
Routing Open
Vault
Synthetic DSPy, distilabel DSPy OpenBao
DSPy CodeInterpreter
Data Pipeline

Hybrid Identity Service


Keycloak

Query/ Local
In-mem, Databases Guardrails Service
Orchestration/Router
API Call DSPy, AICI
Memmap, pgvector DSPy
Base Inferencing Backend Function Calling Backends
Output/ Llama-3 70B Phi-3
Base Model/Function Lifecycle /Control Plane
API Call calling Agent LLMOps
Llama-3 8B/Phi-3 Rust Phoenix
Backend API Services ConfigGen Backend
App/API/Inference on CPU and GPU FastAPI llama3-70b-orpo-industrial
Local Platform on Computer/Edge GW/Phone Cache
Valkey Backend Inferencing and APIs on GPU, CPU, IPU, NPU
Production System with NVIDIA/Microsoft
Data Lakehouse Embeddings Model Embeddings DB
Your Own Data Microsoft Fabric NIM:embed-qa-4 Quant IaC

DevSecOps

Model
Factory Orchestration
Playground Guardrails APIs/Plugins Vault
Prompt DSPy NeMo Guardrails
Routing
Zapier Azure Key Vault
DSPy
Programming

Query/
API Call Hybrid Identity Service
Entra ID

Local
In-mem, Databases Guardrails Service
Orchestration/Router
Quant client NeMo Guardrails
AICI/DSPy
Base Inferencing Backend Documentation Backend
Output/ NIM: Llama-3 70B NIM: Llama-3 8B
Base Model/Function Lifecycle /Control Plane
API Call calling Agent LLMOps
NIM: Llama-3 8B Azure Arc W&B
Backend API Services Coding Backend
FastAPI NIM: Llama-3 70B
NIM Edge
Cache
Redis NIM Server Backend and APIs on CUDA GPU
From Proof of Concept
to Pilot
to Production
Lessons Learned

❑ Set expectations
❑ Minimize risks
❑ Always experiment and build with the North Star to take it to production
❑ Work 3x faster from product start to launch to happen in 6 months
Set Expectations

Building cool demos with GenAI is easy


Building an industrial or enterprise product with GenAI is hard

❑ If you want cool demos to show everyone externally that you’re ahead of the curve, just do it!
❑ If you want your team to experiment and build out AI muscles for production, just do it!
❑ If you want a product, build data, get compute and train talents to build it, and just do it!
There are a lot of things GenAI can do

Q: But can these things meaningfully transform your customers’ business?


A: Unclear
There are a lot of things Generative AI
can’t do NOW

Q: But would GenAI still not be able to do those in the future?


A: Unclear

“When a distinguished but elderly scientist states that something is


possible, he is almost certainly right. When he states that something is
impossible, he is very probably wrong.”
- Arthur Clarke
End-2-End Performance
Optimization
Local AI Platform - Which Way?

Recommended Enthusiasts Hardware: Recommended Pros Hardware:


Ryzen CPU Ryzen CPU Recommended Enthusiasts Hardware: Recommended Pros Hardware:
64GB RAM 256GB RAM MacBook M2/M3 MacBook M3 Max
3090-RTX 6x4090-RTX with P2P Kernel 16GB RAM 128GB RAM
1TB SSD 4TB SSD 1TB SSD 4TB SSD
When Local AI Platform is not Enough
Build Your Own Multi-user AI Platform
Data Lakehouse Embeddings DB
Your Own Data Embeddings Model IaC

DevSecOps

Model
Orchestration
Factory Playground Guardrails
Routing APIs/Plugins
Vault
Synthetic
Data Pipeline

Hybrid Identity Service

Query/ Local
In-mem, Databases Guardrails Service
API Call Orchestration/Router
Base Inferencing Backend Function Calling Backend
Output/
Lifecycle /Control Plane
API Call Base Model/Function
Agent LLMOps
calling
Backend API Services
Coding Backend
App/API/Inference on CPU and GPU
Local Platform on Computer/Edge GW/Phone Cache
Backend Inferencing and APIs on GPU, CPU, IPU, NPU
Lessons Learned

1.Choose the Best Models for your Use Cases


2.Balance fine-tunning and Data Pipelines
3.Activate only when you need
4.Inference Quantized Models
5.Use all available Hardware – Edge to Core
6.Make everything OSS Plug&Play
7.Shorten the Lifecycle of PoC-Pilot-Production
Performance vs Compute Energy

DeepSeek-V2 LLaMA-3 70B


phi-3-med Qwen-110B
phi-3-small Yi-34B
Qwen-72B
DBRX
Qwen-32B
phi-3-mini
Arctic-480B
Yi-9B LLaMA-3 8B
Gemma-7B

Qwen-1.5-MoE-A2.7B

OLMo1.7-7B
Cloud Inferencing Race to the Bottom

Prices per 1M tokens 26/Apr/24


Open AI Platform Security
All Cybersecurity Best Practices Plus…
Meta Prompts, Grounding, ASCII, DSPy red teaming…
What did the WizardLM-2 see?
Beyond the wrappers, RAG
and Prompt Engineering -
Advanced AI Systems
Engineering
Lifecycle of an AI Model

❑ Training:
Data preparation
Efficient training techniques
Evaluation

❑ Fine-tuning:
RLHF, RLAIF

❑ Inference:
Quantization
Deployment
Training Your Own Model

https://arxiv.org/abs/2404.14619
Training Your Own Model

https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
Select Your Data

https://arxiv.org/abs/2402.16827
Data Preparation

❑ Model training requires multiple stages:


Pretraining
Instruction-tuning
Alignment
In-context learning
Task-specific fine-tuning
❑ Each training stage has different goals
❑ Data selection methods will use different mechanisms
Pretraining

❑ Goal: train a general-purpose model with a maximum coverage


❑ Requires: train on massive quantities of text, at least 1 Trillion tokens
❑ Diversity and coverage
Sourcing data from a wide array of domains, including less represented languages and
dialects.
Ensuring the inclusion of various writing styles.
❑ Quality and robustness
Filtering out low-quality, toxic, or biased data to prevent model contamination.
Implementing rigorous testing phases to evaluate the model's performance across
different contexts.
❑ Data quality evaluation: how to measure data quality at the billion tokens scale
Developing metrics to evaluate the relevance and representativeness of data.
Creating automated tools to efficiently identify and remove low-quality or duplicated
content.
Data Preparation
Data Collection Raw Web/Private Data

Safety Filtering

Topic Filtering

Language
Filtering

Semantic Filtering

Text Metric Filtering Perplexity Filtering Paragraph Deduplication

Document Quality Filtering

Repetitive Document Rules Based Correction MinHash Deduplication Exact Deduplication


Removal
Data Preparation
Data Preparation
Data Sources

❑ Very Large that required significant preparation


Common Crawl
GitHub and Software Heritage
HuggingFace FineWeb
❑ Curated
Wikipedia
Public Domain Books
❑ Synthetic Data is the Future
Synthetic Data is the Future

❑ Simple Synthetic Dataset


DSPy Synthesizer v2 (example
https://github.com/stanfordnlp/dspy/tree/81c2f579d50057d51351c259796e07958efdd9d1/
dspy/experimental/synthesizer )
❑ Complex Synthetic Dataset
Distilabel (example https://github.com/argilla-io/distilabel-
workbench/tree/main/projects/farming)
❑ Complex Synthetic Dataset
Cosmopedia (example https://github.com/huggingface/cosmopedia)
15T Tokens Real Web Dataset

15 T
Tokens- Clean and Deduplicated

45 TB Dataset Size

Created In ODC
2024 Open Data Commons License

https://huggingface.co/datasets/HuggingFaceFW/fineweb
Generate Synthetic Datasets Locally

Distilabel is a framework for


synthetic data and AI feedback for
AI engineers that require high-
quality outputs, full data ownership,
Started In and overall efficiency.

2022
16
Contributors
693
stars on Github
8
Active PRs
Create a synthetic dataset seed locally on your own AI
platform for aligning models to a specific domain example
Synthetic Dataset Example: Cosmopedia

30M
Synthetic Samples

8
Domain Splits

Generated In
2024

https://huggingface.co/datasets/HuggingFaceTB/cosmopedia
Coding Dataset Example: The Stack v2

67.5 TB
Full Dataset

32.1 TB
Deduplicated Dataset

Created In 658
2024 Programming Languages

https://huggingface.co/datasets/bigcode/the-stack-v2
Data Filtering

❑ Quality Filtering Heuristics


Controlled
Robust
Clear Priors
❑ Quality Filtering by AI
Classifier-based filtering: fastText classification with an n-gram size of 2
Perplexity-based filtering: 5-gram Kneser-Ney model on Wikipedia
Threshold-based filtering: quality to content filters
❑ Selective Language Modeling SLM
Train a reference model on a high-quality corpus
Use it to reference each token in a corpus using its loss
Use only tokens with a high excess loss between reference and the training model
Data Deduplication

❑ Fuzzy
BLOOM Filters for hashing and fixed-size vector
MinHash for hashing and sorting
❑ Exact
Exact substrings with a suffix array
Sentence deduplication
❑ Over-deduplication may keep only the bad data
Prepare the Data for Pre-Training

❑Shuffle
❑Tokenizers
❑Tokenization Scaling
Data Quality Evaluation

❑Start With a Small 1-2B Model


❑Manual Data Inspection
❑Clustering
Model Training

❑ Size and Efficiency


Parallelism
Asynchronous
Kernel Merging
Attention
❑ Training Recipe Stability
❑ Capacity Scale
Mixture of Experts
Mixture of Depths
Creating Hybrids Transformer/RNN, Transformer/SSM
4-D Parallelism

❑ Data
Compute efficiency for gradient all-reduce, training efficiency of batch-size
❑ Tensor
Rewrite model code
Reduce sync points with combined column/row slicing
❑ Pipeline
Group sub-parts of the networks
Optimize GPUs utilization
❑ Sequence

Breadth-First Pipeline Parallelism https://arxiv.org/abs/2211.05953


Reducing Activation Recomputation in Large Transformer Models https://arxiv.org/abs/2205.05198
Sequence Parallelism: Long Sequence Training from System Perspective https://arxiv.org/abs/2105.13120
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning https://arxiv.org/abs/2307.08691
Training Recipes

❑Initialization
❑Stabilization
❑Learning Rate
❑Scaling hyper-parameters results
MiniCPM V2.0 https://huggingface.co/openbmb/MiniCPM-V-2
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Transfer https://arxiv.org/abs/2203.03466
Synthetic AI Recipe as an Emerging Trend
The Mystic WizardLM-2

❑ Data Pre-Processing
Weighted Sampling
Progressive Learning
❑ Evol-Instruct
❑ AI Aligns AI (AAA)
Co-Teaching
Self-Teaching
❑ Supervised Learning
Staged-DPO
RLEIF with IRM and PRM

Research Paper expected in Q2CY24


Alignment

❑Reinforced Learning by Human Feedback


Direct Preference Optimization
Odds Ratio Preference Optimization (Loss function of Alignment and SFT)

❑Reinforced Learning by AI Feedback


❑Reinforced Learning by Evol-Instruct Feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/abs/2305.18290
ORPO: Monolithic Preference Optimization without Reference Model https://arxiv.org/abs/2403.07691
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback https://arxiv.org/abs/2309.00267
Build System of Systems

Data Lakehouse Embeddings DB


Your Own Data Embeddings Model IaC

DevSecOps

Model
Orchestration
Factory Playground Guardrails
Routing APIs/Plugins
Vault
Synthetic
Data Pipeline

Hybrid Identity Service

Query/ Local
In-mem, Databases Guardrails Service
API Call Orchestration/Router
Base Inferencing Backend Function Calling Backend
Output/
Lifecycle /Control Plane
API Call Base Model/Function
Agent LLMOps
calling
Backend API Services
Coding Backend
App/API/Inference on CPU and GPU
Local Platform on Computer/Edge GW/Phone Cache
Backend Inferencing and APIs on GPU, CPU, IPU, NPU
Practical Use Cases
Practical Use Cases

1.Content Creation
2.Automation of Routine Tasks
3.Human-Computer Interface Personalization
4.Assisted Software Development
5.Design and Prototyping
6.Synthetic Data Generation
Thank You!
We Will Meet Again!
Backup slides

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy