0% found this document useful (0 votes)

40 views26 pages

Create Purpose-Built AI Using Vision and Language

Uploaded by

tsiorymahefaniaina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views26 pages

Create Purpose-Built AI Using Vision and Language

Uploaded by

tsiorymahefaniaina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Create Purpose-built AI Using

Vision and Language With

Multi-modal Foundation Models
Mireille Fares, PhD
Senior AI Solution Architect | NVIDIA
Multi-modal Foundation Models

Data Foundation Models

• Trained on massive corpus of text, image, video, data
Text
• Unsupervised learning

• Multi-modal: vision, text, speech, sensor

Images

Speech Foundation
Training Model
Benefits
• Highly generalizable
Structured
Data
• Highly versatile and adaptable across many domains

• Natural language interaction

3D Signals
• Non-rigid: zero-shot and few-shot
Applying Multi-modal Foundation Models

Video Search and Summarization Real-time Asset Tracking Customer Assistance

Content Generation Detecting Hazardous Condition Human-robot Interaction

Foundation Models Are Great, but Not Perfect
Zero-shot inference

Reference Test

Question: What type of shot is this?

Answer: This is a basketball shot Change Prediction Ground Truth
Prompt: Person With Basketball Video Summarization Change Detection
Model Customization Workflow

Synthetic Data
Stable Diffusion FM

Human in the Annotated

Loop Dataset

Real Data
Foundation Model
Customization Fine-tuned
Model

Foundation Model

Data Prep Model Fine-tuning

Higher Accuracy Better Predictability Faster Inference

Customization Techniques for Visual Foundation Models
Overcome the challenges of using Foundation Models

Model Customization

Full Fine-tuning
Update weights of entire
model including the
foundation backbone
Defect Inspection Video Agent
Foundation Model Your Custom Model

Last Layer or Head Fine-tuning Autonomous Image to Text

Freeze the foundation backbone Checkout
and fine-tune the last few layers
Start with
pre-trained model

In-context Learning
Use visual prompting and
model chaining to improve
contextual awareness
Optimizing Foundation Models for Inference

FP32
(Pre-Trained)
Teacher Model
FP16

INT8

Backprop Distillation
INT4
Original Network Pruned Network
Loss

Activation-aware Weight Quantization (AWQ) L0 Regularization

Student Model
Post-training Quantization (PTQ) Variational Dropout

Quantization Aware Training (QAT) Magnitude Pruning

Knowledge Distillation Quantization Model Pruning

NVIDIA Optimized Visual Foundation Models

Model Purpose

NV-DINOv2 Vision-only backbone for downstream Vision AI tasks–image classification, detection, segmentation

Image-text matching model that can align image features with text features; backbone for downstream Vision AI
NV-CLIP
tasks–image classification, detection, segmentation

Grounding-DINO Open-vocabulary object detection with text prompts as input

Faster, more efficient version of SAM (Segment Any Model), which is a visual FM for segmenting any object based on
EfficientViT-SAM
different types of visual prompts such as single coordinate or bounding box

VILA Family of visual language model for image and video understanding and Q&A

LITA Visual language model for video understanding and context with spatial and temporal localization

Foundation Pose 6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box

Sensor fusion model which fuses multiple input sensors–cameras, LiDAR, radar, etc. to create a bird’s eye view of the
BEVFusion
scene with 3D bounding box representation of the objects

NeVa Multi-modal visual language model for image understanding and Q&A

LiDARSAM Segment any object based on user provided text prompts on 3D point cloud LiDAR data
Multi-Modal Foundation Backbone - NV-CLIP

Zero-shot Accuracy (ImageNet-1K)

†
Model NV-CLIP OpenAI CLIP
Colorful cat Text
Colorful cat Encoder
Colorful cat
ViT-B 70.4 68.6

††
… ViT-H 77.4 78.0
T1 T2 T3 … TN †
- Non-commercial use only
††
- Trained on 700M image-text pair vs. 2B for CLIP

T1 I1T1 I1T2 I1T3 … I1TN

Model Inference Performance (FPS)
6000
I2 I2T1 I2T2 I2T3 … I2TN
5000
H100 A100 L40 A30 L4 A2
Image
Encoder I3 I3T1 I3T2 I3T3 … I3TN 4000

3000
… … … … … … …
2000

IN INT1 INT2 INT3 … INTN 1000

0
ViT-B ViT-H

Commercially Viable Trained on Very Large Dataset Foundation Backbone for Vision AI
Trained on ethically sourced data and compares Trained on 700M image-text pairs for text and Used in many downstream vision tasks like zero-shot
favorably to other non-commercial public models image embeddings detection/segmentation, VLMs and more

Available in April 2024

Using Foundation Backbones for Downstream CV Task

Classification Class label

Feature
Vector

Detection Bounding box & labels

Segmentation Class label/pixel (mask)

Foundation
Backbones
NV-DINO / Zero-shot Class label, b-box, mask, text
Data NV-CLIP

Image Retrieval Image

VLMs Text, image

Diffusion Image
Fine-Tune in 100 or Less Samples for Image Classification

Train in as few as 10 samples

Few-shot Learning on NV-DINOv2
Ground Truth Labels
100
NV-DINOv2
90
GC-ViT
80
Feature Vector
70
Feature Vector
0.2 60
Fine-tune with TAO
0.3
50
1.4 10 100 1000
NV-DINOv2 Trained Weights
PCB defect classification
.
. Demo: Foundational Model Fine-tuning
Dataset Inference on the
.
Foundational Model
Prediction
1.1
- Frozen Inference

NV-DINOv2 Foundational Model Trained on >100M Image/Text Pair

Fine-tune Visual Change Detection

Dataset Model Accuracy

Multi-class Multi-class
NV-DINOv2 Adapter
98%
Satellite Imagery Segmentation

Industrial Defect
Classification 100%
Detection
MLP
Golden Image Decoder
Indoor Warehouse Multi-class
93.5%
Scene Segmentation

NV-DINOv2 Adapter Change Detection

Outdoor Scene Binary
98%
Binary Segmentation

Inference Performance (FPS)

900 841
- Frozen
Test Image 800
- Trainable 700
600
500 435
Fine-tuning Configuration 400 364
300
model:
# pretrained: null 200 131.5
backbone: 100 55 36 59.7
15.2 22
type: "vit_large_nvdinov2" 0
pretrained_backbone_path: /path/to-nvdino.ckpt Orin Orin AGX A2 T4 L4 L40 A100 H100
freeze_backbone: True
Nano NX Orin
decode_head:
8GB 16GB 64GB
feature_strides: [4, 8, 16, 32]
Zero-shot Detection With Context Using Grounding-DINO

Image
Image Features Image Features
Backbone
Swin-B

Feature Query
Enhancer Selection

Cross-
Prompt modalities
Text Queries
Text Features Text Features
“Person wearing Backbone
glasses” BERT-B Cross-modality
Decoder
Language
Cross-modality
Guided
Feature Fusion

Use Cases:

Image Search Detecting Anomalies Unseen Environment Auto labeling

Fine-Tune with Domain Annotation
• Regular detection annotation
Specific Data • Grounding annotation, caption and B-box

Grounding Annotation
{
"grounding": {
"caption": "a wire hanger with a paper cover ...",
"regions": [
{
Image "bbox": [20,215,985,665],
Backbone
BERT-B "phrase": "a paper cover that reads we heart ..."
},
...
Feature
Enhancer
Query
Selection
]
}
Prompt

“Person wearing Text Cross-

glasses” Backbone modality
Swin-B Decoder Fine-tuning Configuration
train:
num_gpus: 1
optim:
Sports Use Case lr_backbone: 1e-05
lr: 0.0001
• Fine-tuned on 125K images basketball activity with lr_steps: [4]
grounding annotation num_epochs: 6
freeze: [“backbone.0”, “bert”] #Freeze visual / text backbone
pretrained_model_path: /path/to/swinb.pth
• GPUs: 8 x A100 model:
backbone: swin_base_384_22k
• Total training time: 6 hours num_feature_levels: 4
dec_layers: 6
enc_layers: 6
Available for fine-tuning in April 2024 ...
Visual Language Model Using VILA

Augmenting LLMs With Visual Input SOTA Accuracy for Visual Language Optimizing Inference With 4-bit AWQ

180

160

140

120

Tokens/sec
100

0
VILA-7B-AWQ VILA-13B-AWQ
RTX 4090 A100 AGX Orin

https://arxiv.org/pdf/2312.07533.pdf
Improving Spatial and Contextual Awareness Using SoM Prompting

Prompt
Prompt
There are 1-8 numeric IDs that are
person class. Which numeric ID is in
Person most danger in this image

Answer

The firefighter wearing the number

Efficient 4 is in the most danger in this image.
Grounding-DINO ViLA
ViT SAM He is standing closest to the fire,
which is actively burning and
producing thick smoke. This puts him
at a higher risk of being affected by
the heat, smoke, and potential
hazards associated with the fire.
Adding Temporal Understanding to VLMs With LiTA
Language Assisted Temporal Understanding (LiTA)

VLM Capabilities

Event @t1 Event @t2 … Event @tN • Spatial and temporal localization

• Contextual understanding

Vision • Optical character recognition

Linear Temporal
Encoder
Projection Pooling
Answer (text, logos)
ViT
Free throw happened
at 0.43-0.50, 1.20- • Inference metadata out for apps
Visual Tokens Fast/Slow LLM Caption
Timestamp 2.30
Tokens • Videos and images
Prompt
summarization
When did
Text
free-throws happen Tokenizer • Text response for Interactive
in the video? Q&A
Text • Timestamps, events, sequence
Tokens
"820627698-640": {

Data Preparation for LiTA "duration": 20.0,

"timestamps": [
[ 5.0, 6.0 ],

Fine-tuning ],
[ 11.0, 14.0 ]

"sentences": [
"Player in black passes to another player in black",
"Player in black shoots and makes a 2-point shot"
],
"events": [
"basketball pass",
"basketball shot"
]
}

Create Dense {
"video_id": "820627698-640",
Captioning "conversations": [
{
"role": "user",
"content": "Write a concise and clear dense caption for the provided sports video, focusing on key basketball
events such as 3-point shots, 2-point shots, free throws, and passes."
GPT4 },
{
"role": "assistant",
Generate Q&A "content": "<5.0s> <6.0s> A player in black passes the ball to another player in black. <11.0s> <14.0s> A
Pairs player in black shoots the ball and completes a 2-point shot."
},
{
"role": "user",
"content": "At what time does the player in black passes the ball?"
Register Dataset },
{
for Training "role": "assistant",
"content": "The player in black passes the ball to another player in black at <5.0s>."
},

...

]
}
Train
Fine-tuning LiTA for Custom Use Case

Supervised Fine-tuning with LoRA

Dataset Recommendation:
• 1000 videos for good fine-tuning (100 minimum)
Temporal
Vision Linear Pooling • 1–2 min. video clips
Encoder ViT Projection
• Events between 1% - 10% of the clip duration
Visual Fast/Slow LLM Answer
Tokens Tokens

Text
Prompt
Prompt
Tokenizer Training KPIs:
• Sports (basketball) action dataset
Text
Tokens • Trained ~1K images in 16 hours on 8xA100
- Frozen - Trainable

Model Size: 13.3B parameters

Trainable Parameters: 50M (with LoRA)

LoRA: Low-Rank Adaptation of LLMs

https://arxiv.org/pdf/2106.09685.pdf
LiTA fine-tuning

Before After Fine-Tuning

Question: What type of shot is this?

Question: What type of shot is this? Answer: This is a 2-pointer shot from the
Answer: This is a basketball shot corner
TensorRT-LLM Optimizing LLM and VLM Inference
SoTA performance for Large Language Models for production deployments

TensorRT-LLM - open-source library to optimize inference performance on the latest Large Language Models for NVIDIA GPUs.

A100 H100 TRT-LLM Static Inflight Batching

# define a new activation
def silu(input: Tensor) → Tensor:
4.6x return input * sigmoid(input)

#implement models like in DL FWs

class LlamaModel(Module)
def __init__(…)
self.layers = ModuleList([…]) 2x
3x def forward (…)
hidden = self.embedding(…)

for layer in self.layers: 5x

hidden_states = layer(hidden)

return hidden
Performance TCO Avg Latency Cost

SoTA Performance Ease Extension LLM Batching with Triton

Leverage TensorRT compilation & kernels from Add new operators or models in Python to quickly Maximize throughput and GPU utilization through new
Faster Transformers, CUTLASS, OAI Triton, ++ support new LLMs with optimized performance scheduling techniques for LLMs

Numbers are preliminary based on internal evaluation on Llama 7B on H100

TAO
Build Models for Any Platform With NVIDIA TAO TAO
Fast-track data generation, AI model creation and deployment

Data Train Deploy

Omniverse Replicator
Omniverse Replicator
Synthetic Data

ONNX
GPU CPU MCU DLA
TAO Any Inference Platform

Far-Edge | Edge | Cloud

Data
Omniverse Replicator Omniverse Replicator
Foundationl Models

Real Data Edge | Workstation | Cloud

ViT | Multi-modal
Announcing TAO Hosted APIs for Fine-tuning and Customization

NVIDIA
NGC

TAO Interface TAO API Server Supported FM (April)

Your Front-end NGC Controller
• NV-CLIP
UI Service
• NV-DINOv2
Developer NVIDIA TAO UI API API Handlers Cloud • Grounding-DINO

Workflow
TAO CLI Client AutoML
Manager Datasets Models
Future Support
(Q3 2024)
Data API
• VILA

NVIDIA DGX Cloud • LiTA

Datasets • EfficientViT SAM
Data

TAO TF1 TAO TF2 TAO PyT TAO TAO Data

Deploy Services Model
Summary and CTA

• Unlock new potential with multi- Tasks

modal foundation models–NV-CLIP,
Grounding-DINO, VILA, LiTA
Visual Change
• Fine-tuning essential for last-mile Detection
delivery Data

• NVIDIA AI tools and APIs for building Text Video/Image

and optimizing Generative AI models– Summarization
TensorRT, NeMo, TAO
Images
• TAO hosted APIs for fine-tuning and Video Q&A
customization
Adaption
Speech Foundation
Training Model Image
Sign-up for TAO API EA: NVIDIA TAO Captioning
Hosted API - Early Access Program Structured
Data
Object
Recognition
3D Signals

Image
Generation

State-Space Models in Computer Vision (2023-2025) - A Deep Dive
No ratings yet
State-Space Models in Computer Vision (2023-2025) - A Deep Dive
14 pages
ViTA A Vision Transformer Inference Accelerator For Edge Applications
No ratings yet
ViTA A Vision Transformer Inference Accelerator For Edge Applications
5 pages
Quantum Mechanics
100% (1)
Quantum Mechanics
397 pages
Option MCQ
No ratings yet
Option MCQ
10 pages
Modicon Momentum: 170 AEC 920 00 User Manual
No ratings yet
Modicon Momentum: 170 AEC 920 00 User Manual
150 pages
LLM
No ratings yet
LLM
28 pages
Vision Mamba: Rethinking Visual Representation With Bidirectional LSTMs
No ratings yet
Vision Mamba: Rethinking Visual Representation With Bidirectional LSTMs
7 pages
Advanced Topics in Autonomous Driving Using Deep Learning: Presenter: Nasim Souly
No ratings yet
Advanced Topics in Autonomous Driving Using Deep Learning: Presenter: Nasim Souly
41 pages
2019 CVPR Paper Overview: Sualab Ho Seong Lee
No ratings yet
2019 CVPR Paper Overview: Sualab Ho Seong Lee
30 pages
Benchmarking Detection Transfer Learning With Vision Transformers
No ratings yet
Benchmarking Detection Transfer Learning With Vision Transformers
9 pages
10 1109@TVCG 2019 2903943 PDF
No ratings yet
10 1109@TVCG 2019 2903943 PDF
11 pages
Talk MBA AI XAI 3 PDF
No ratings yet
Talk MBA AI XAI 3 PDF
81 pages
Visual Analytics For Explainable Deep Learning
No ratings yet
Visual Analytics For Explainable Deep Learning
10 pages
Inicai 2V1
No ratings yet
Inicai 2V1
7 pages
CV Project
No ratings yet
CV Project
7 pages
An Image Is Worth More Than 16x16 Patches
No ratings yet
An Image Is Worth More Than 16x16 Patches
23 pages
Vision Transformer (Vit) : Shusen Wang
No ratings yet
Vision Transformer (Vit) : Shusen Wang
35 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Research Notes
No ratings yet
Research Notes
9 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Paper 3
No ratings yet
Paper 3
11 pages
Zhu Et Al. - 2024 - Vision Mamba Efficient Visual Representation Lear
No ratings yet
Zhu Et Al. - 2024 - Vision Mamba Efficient Visual Representation Lear
11 pages
Research Paper (2) Done
No ratings yet
Research Paper (2) Done
17 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
2nd Year LMTS 2025
No ratings yet
2nd Year LMTS 2025
7 pages
A9 Mini
No ratings yet
A9 Mini
68 pages
SAVi++ Towards End-to-End Object-CentricSAVi++ Towards End-to-End Object-Centric
No ratings yet
SAVi++ Towards End-to-End Object-CentricSAVi++ Towards End-to-End Object-Centric
21 pages
Flash Cards
No ratings yet
Flash Cards
4 pages
Stable Diffusion
No ratings yet
Stable Diffusion
19 pages
Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection
No ratings yet
Transforming Sensor Data To The Image Domain For Deep Learning - An Application To Footstep Detection
8 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
Building CNN Model - Formatted Paper
No ratings yet
Building CNN Model - Formatted Paper
7 pages
04-Personalization&editing Paul Final
No ratings yet
04-Personalization&editing Paul Final
47 pages
Final Report Yolo Voice
No ratings yet
Final Report Yolo Voice
94 pages
Paper 3
No ratings yet
Paper 3
7 pages
VQA ViT
No ratings yet
VQA ViT
24 pages
Seminar 1
No ratings yet
Seminar 1
22 pages
Ncert ch2 Chemistry Class 11
No ratings yet
Ncert ch2 Chemistry Class 11
44 pages
CAPformer Pedestrian Crossing Action Prediction Us
No ratings yet
CAPformer Pedestrian Crossing Action Prediction Us
22 pages
IMP-1A Review On Deep Learning Techniques For Video Prediction
No ratings yet
IMP-1A Review On Deep Learning Techniques For Video Prediction
26 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Deep Learning Approaches To Predict Future Frames in Videos
No ratings yet
Deep Learning Approaches To Predict Future Frames in Videos
17 pages
Group Number - 2 - MOVING OBJECT CLASSIFICATION USING YOLO Algorithm
No ratings yet
Group Number - 2 - MOVING OBJECT CLASSIFICATION USING YOLO Algorithm
15 pages
Good Note - ViT
No ratings yet
Good Note - ViT
13 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
Ethernet Introduction PDF
100% (1)
Ethernet Introduction PDF
2 pages
Unit-5 (DL For Different Domains, Role of GPUs and DL Frameworks)
No ratings yet
Unit-5 (DL For Different Domains, Role of GPUs and DL Frameworks)
15 pages
Vision Transformer (ViT)
No ratings yet
Vision Transformer (ViT)
26 pages
AI-Powered Visual Sensors and Sensing: Where We Are and Where WeAreGoing
No ratings yet
AI-Powered Visual Sensors and Sensing: Where We Are and Where WeAreGoing
17 pages
Vision Mamba
No ratings yet
Vision Mamba
14 pages
Gaurav Vision Transformer
No ratings yet
Gaurav Vision Transformer
10 pages
Autoregressive Video Generation
No ratings yet
Autoregressive Video Generation
22 pages
8th Sem Major - Project - PPT
No ratings yet
8th Sem Major - Project - PPT
22 pages
Phase 1 Review 2
No ratings yet
Phase 1 Review 2
21 pages
Part 2
No ratings yet
Part 2
225 pages
A Survey On Visual Transformer
No ratings yet
A Survey On Visual Transformer
23 pages
Aiav Unit 2 Notes
No ratings yet
Aiav Unit 2 Notes
8 pages
Web Programming Multiple Choice Questions EDMAPRA
No ratings yet
Web Programming Multiple Choice Questions EDMAPRA
7 pages
Research Paper UGR - Team-07
No ratings yet
Research Paper UGR - Team-07
16 pages
11 Deep Transfer Learning and Multi Task Learning
No ratings yet
11 Deep Transfer Learning and Multi Task Learning
24 pages
Object Detection Using OpenCV and Python
No ratings yet
Object Detection Using OpenCV and Python
5 pages
SoS'25 Midterm - Report
No ratings yet
SoS'25 Midterm - Report
14 pages
Accident Detection
No ratings yet
Accident Detection
5 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
Biosintesis Lipid
No ratings yet
Biosintesis Lipid
27 pages
Steel Grades 2 PDF
0% (1)
Steel Grades 2 PDF
2 pages
Node
100% (1)
Node
31 pages
Factors Influencing Internet Banking Adoption in South African Rural Areas
No ratings yet
Factors Influencing Internet Banking Adoption in South African Rural Areas
8 pages
AI60201 Module3
No ratings yet
AI60201 Module3
61 pages
Quantifying Rooftop Photovoltaic Solar Energy Potential - A Machine Leaning Approach (Assouline - 2017)
No ratings yet
Quantifying Rooftop Photovoltaic Solar Energy Potential - A Machine Leaning Approach (Assouline - 2017)
19 pages
Absolutely
No ratings yet
Absolutely
16 pages
05 - Thermodynamic - Cycles - (Rankine) PDF
No ratings yet
05 - Thermodynamic - Cycles - (Rankine) PDF
6 pages
Problems On Drying
100% (1)
Problems On Drying
1 page
Gke Product Catalog 1
No ratings yet
Gke Product Catalog 1
22 pages
Heating: Surrounded by Quality
No ratings yet
Heating: Surrounded by Quality
2 pages
Practical Analytical 1 ,,chemistry
No ratings yet
Practical Analytical 1 ,,chemistry
45 pages
MC 96 Duty Cycle Crane en 905 673 2
No ratings yet
MC 96 Duty Cycle Crane en 905 673 2
16 pages
نموذج1
No ratings yet
نموذج1
11 pages
ACE 318 - Table
No ratings yet
ACE 318 - Table
6 pages
E-PROC-ENG-B1-111 Rev006
No ratings yet
E-PROC-ENG-B1-111 Rev006
32 pages
4.2 Area of Parallelogram and Trapezium
No ratings yet
4.2 Area of Parallelogram and Trapezium
3 pages
Evaluating The Incompatibility of Inorganic Zinc Silicate
No ratings yet
Evaluating The Incompatibility of Inorganic Zinc Silicate
8 pages
How To Identify Silt & Clay in The Field
No ratings yet
How To Identify Silt & Clay in The Field
3 pages
Ees452 2021 2.1
No ratings yet
Ees452 2021 2.1
9 pages
ICS 2311 Course Outline
No ratings yet
ICS 2311 Course Outline
1 page
LLT8QT Light Tower Specs
No ratings yet
LLT8QT Light Tower Specs
3 pages
Sys Master - Pdfs - h47 - h97 - 10137304694814 - COA - RTC PHR 1003 - ST WB CERT 2140232 1 1 1
No ratings yet
Sys Master - Pdfs - h47 - h97 - 10137304694814 - COA - RTC PHR 1003 - ST WB CERT 2140232 1 1 1
7 pages
Standard Practice Paper MS Set B
No ratings yet
Standard Practice Paper MS Set B
8 pages
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
From Everand
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
Yuxi (Hayden) Liu
No ratings yet
Python迁移学习: Chinese Edition
From Everand
Python迁移学习: Chinese Edition
Posts & Telecom Press
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Create Purpose-Built AI Using Vision and Language

Uploaded by

Create Purpose-Built AI Using Vision and Language

Uploaded by

Create Purpose-built AI Using

Vision and Language With

Data Foundation Models

• Multi-modal: vision, text, speech, sensor

• Natural language interaction

Video Search and Summarization Real-time Asset Tracking Customer Assistance

Content Generation Detecting Hazardous Condition Human-robot Interaction

Question: What type of shot is this?

Human in the Annotated

Data Prep Model Fine-tuning

Higher Accuracy Better Predictability Faster Inference

Last Layer or Head Fine-tuning Autonomous Image to Text

Activation-aware Weight Quantization (AWQ) L0 Regularization

Quantization Aware Training (QAT) Magnitude Pruning

Knowledge Distillation Quantization Model Pruning

Grounding-DINO Open-vocabulary object detection with text prompts as input

Zero-shot Accuracy (ImageNet-1K)

T1 I1T1 I1T2 I1T3 … I1TN

IN INT1 INT2 INT3 … INTN 1000

Available in April 2024

Classification Class label

Detection Bounding box & labels

Segmentation Class label/pixel (mask)

Image Retrieval Image

VLMs Text, image

Train in as few as 10 samples

NV-DINOv2 Foundational Model Trained on >100M Image/Text Pair

Dataset Model Accuracy

NV-DINOv2 Adapter Change Detection

Inference Performance (FPS)

Image Search Detecting Anomalies Unseen Environment Auto labeling

“Person wearing Text Cross-

The firefighter wearing the number

Vision • Optical character recognition

Data Preparation for LiTA "duration": 20.0,

Supervised Fine-tuning with LoRA

Model Size: 13.3B parameters

LoRA: Low-Rank Adaptation of LLMs

Before After Fine-Tuning

Question: What type of shot is this?

A100 H100 TRT-LLM Static Inflight Batching

#implement models like in DL FWs

for layer in self.layers: 5x

SoTA Performance Ease Extension LLM Batching with Triton

Numbers are preliminary based on internal evaluation on Llama 7B on H100

Data Train Deploy

Far-Edge | Edge | Cloud

Real Data Edge | Workstation | Cloud

TAO Interface TAO API Server Supported FM (April)

NVIDIA DGX Cloud • LiTA

TAO TF1 TAO TF2 TAO PyT TAO TAO Data

• Unlock new potential with multi- Tasks

• NVIDIA AI tools and APIs for building Text Video/Image

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.