0% found this document useful (0 votes)
40 views26 pages

Create Purpose-Built AI Using Vision and Language

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views26 pages

Create Purpose-Built AI Using Vision and Language

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Create Purpose-built AI Using

Vision and Language With


Multi-modal Foundation Models
Mireille Fares, PhD
Senior AI Solution Architect | NVIDIA
Multi-modal Foundation Models

Data Foundation Models


• Trained on massive corpus of text, image, video, data
Text
• Unsupervised learning

• Multi-modal: vision, text, speech, sensor


Images

Speech Foundation
Training Model
Benefits
• Highly generalizable
Structured
Data
• Highly versatile and adaptable across many domains

• Natural language interaction


3D Signals
• Non-rigid: zero-shot and few-shot
Applying Multi-modal Foundation Models

Video Search and Summarization Real-time Asset Tracking Customer Assistance

Content Generation Detecting Hazardous Condition Human-robot Interaction


Foundation Models Are Great, but Not Perfect
Zero-shot inference

Reference Test

Question: What type of shot is this?


Answer: This is a basketball shot Change Prediction Ground Truth
Prompt: Person With Basketball Video Summarization Change Detection
Model Customization Workflow

Synthetic Data
Stable Diffusion FM

Human in the Annotated


Loop Dataset

Real Data
Foundation Model
Customization Fine-tuned
Model

Foundation Model

Data Prep Model Fine-tuning

Higher Accuracy Better Predictability Faster Inference


Customization Techniques for Visual Foundation Models
Overcome the challenges of using Foundation Models

Model Customization

Full Fine-tuning
Update weights of entire
model including the
foundation backbone
Defect Inspection Video Agent
Foundation Model Your Custom Model

Last Layer or Head Fine-tuning Autonomous Image to Text


Freeze the foundation backbone Checkout
and fine-tune the last few layers
Start with
pre-trained model

In-context Learning
Use visual prompting and
model chaining to improve
contextual awareness
Optimizing Foundation Models for Inference

FP32
(Pre-Trained)
Teacher Model
FP16

INT8

Backprop Distillation
INT4
Original Network Pruned Network
Loss

Activation-aware Weight Quantization (AWQ) L0 Regularization


Student Model
Post-training Quantization (PTQ) Variational Dropout

Quantization Aware Training (QAT) Magnitude Pruning

Knowledge Distillation Quantization Model Pruning


NVIDIA Optimized Visual Foundation Models

Model Purpose

NV-DINOv2 Vision-only backbone for downstream Vision AI tasks–image classification, detection, segmentation

Image-text matching model that can align image features with text features; backbone for downstream Vision AI
NV-CLIP
tasks–image classification, detection, segmentation

Grounding-DINO Open-vocabulary object detection with text prompts as input

Faster, more efficient version of SAM (Segment Any Model), which is a visual FM for segmenting any object based on
EfficientViT-SAM
different types of visual prompts such as single coordinate or bounding box

VILA Family of visual language model for image and video understanding and Q&A

LITA Visual language model for video understanding and context with spatial and temporal localization

Foundation Pose 6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box

Sensor fusion model which fuses multiple input sensors–cameras, LiDAR, radar, etc. to create a bird’s eye view of the
BEVFusion
scene with 3D bounding box representation of the objects

NeVa Multi-modal visual language model for image understanding and Q&A

LiDARSAM Segment any object based on user provided text prompts on 3D point cloud LiDAR data
Multi-Modal Foundation Backbone - NV-CLIP

Zero-shot Accuracy (ImageNet-1K)



Model NV-CLIP OpenAI CLIP
Colorful cat Text
Colorful cat Encoder
Colorful cat
ViT-B 70.4 68.6

††
… ViT-H 77.4 78.0
T1 T2 T3 … TN †
- Non-commercial use only
††
- Trained on 700M image-text pair vs. 2B for CLIP

T1 I1T1 I1T2 I1T3 … I1TN


Model Inference Performance (FPS)
6000
I2 I2T1 I2T2 I2T3 … I2TN
5000
H100 A100 L40 A30 L4 A2
Image
Encoder I3 I3T1 I3T2 I3T3 … I3TN 4000

3000
… … … … … … …
2000

IN INT1 INT2 INT3 … INTN 1000

0
ViT-B ViT-H

Commercially Viable Trained on Very Large Dataset Foundation Backbone for Vision AI
Trained on ethically sourced data and compares Trained on 700M image-text pairs for text and Used in many downstream vision tasks like zero-shot
favorably to other non-commercial public models image embeddings detection/segmentation, VLMs and more

Available in April 2024


Using Foundation Backbones for Downstream CV Task

Classification Class label


Feature
Vector

Detection Bounding box & labels

Segmentation Class label/pixel (mask)

Foundation
Backbones
NV-DINO / Zero-shot Class label, b-box, mask, text
Data NV-CLIP

Image Retrieval Image

VLMs Text, image

Diffusion Image
Fine-Tune in 100 or Less Samples for Image Classification

Train in as few as 10 samples


Few-shot Learning on NV-DINOv2
Ground Truth Labels
100
NV-DINOv2
90
GC-ViT
80
Feature Vector
70
Feature Vector
0.2 60
Fine-tune with TAO
0.3
50
1.4 10 100 1000
NV-DINOv2 Trained Weights
PCB defect classification
.
. Demo: Foundational Model Fine-tuning
Dataset Inference on the
.
Foundational Model
Prediction
1.1
- Frozen Inference

NV-DINOv2 Foundational Model Trained on >100M Image/Text Pair


Fine-tune Visual Change Detection

Dataset Model Accuracy

Multi-class Multi-class
NV-DINOv2 Adapter
98%
Satellite Imagery Segmentation

Industrial Defect
Classification 100%
Detection
MLP
Golden Image Decoder
Indoor Warehouse Multi-class
93.5%
Scene Segmentation

NV-DINOv2 Adapter Change Detection


Outdoor Scene Binary
98%
Binary Segmentation

Inference Performance (FPS)


900 841
- Frozen
Test Image 800
- Trainable 700
600
500 435
Fine-tuning Configuration 400 364
300
model:
# pretrained: null 200 131.5
backbone: 100 55 36 59.7
15.2 22
type: "vit_large_nvdinov2" 0
pretrained_backbone_path: /path/to-nvdino.ckpt Orin Orin AGX A2 T4 L4 L40 A100 H100
freeze_backbone: True
Nano NX Orin
decode_head:
8GB 16GB 64GB
feature_strides: [4, 8, 16, 32]
Zero-shot Detection With Context Using Grounding-DINO

Image
Image Features Image Features
Backbone
Swin-B

Feature Query
Enhancer Selection

Cross-
Prompt modalities
Text Queries
Text Features Text Features
“Person wearing Backbone
glasses” BERT-B Cross-modality
Decoder
Language
Cross-modality
Guided
Feature Fusion

Use Cases:

Image Search Detecting Anomalies Unseen Environment Auto labeling


Fine-Tune with Domain Annotation
• Regular detection annotation
Specific Data • Grounding annotation, caption and B-box

Grounding Annotation
{
"grounding": {
"caption": "a wire hanger with a paper cover ...",
"regions": [
{
Image "bbox": [20,215,985,665],
Backbone
BERT-B "phrase": "a paper cover that reads we heart ..."
},
...
Feature
Enhancer
Query
Selection
]
}
Prompt

“Person wearing Text Cross-


glasses” Backbone modality
Swin-B Decoder Fine-tuning Configuration
train:
num_gpus: 1
optim:
Sports Use Case lr_backbone: 1e-05
lr: 0.0001
• Fine-tuned on 125K images basketball activity with lr_steps: [4]
grounding annotation num_epochs: 6
freeze: [“backbone.0”, “bert”] #Freeze visual / text backbone
pretrained_model_path: /path/to/swinb.pth
• GPUs: 8 x A100 model:
backbone: swin_base_384_22k
• Total training time: 6 hours num_feature_levels: 4
dec_layers: 6
enc_layers: 6
Available for fine-tuning in April 2024 ...
Visual Language Model Using VILA

Augmenting LLMs With Visual Input SOTA Accuracy for Visual Language Optimizing Inference With 4-bit AWQ

180

160

140

120

Tokens/sec
100

80

60

40

20

0
VILA-7B-AWQ VILA-13B-AWQ
RTX 4090 A100 AGX Orin

https://arxiv.org/pdf/2312.07533.pdf
Improving Spatial and Contextual Awareness Using SoM Prompting

Prompt
Prompt
There are 1-8 numeric IDs that are
person class. Which numeric ID is in
Person most danger in this image

Answer

The firefighter wearing the number


Efficient 4 is in the most danger in this image.
Grounding-DINO ViLA
ViT SAM He is standing closest to the fire,
which is actively burning and
producing thick smoke. This puts him
at a higher risk of being affected by
the heat, smoke, and potential
hazards associated with the fire.
Adding Temporal Understanding to VLMs With LiTA
Language Assisted Temporal Understanding (LiTA)

VLM Capabilities

Event @t1 Event @t2 … Event @tN • Spatial and temporal localization

• Contextual understanding

Vision • Optical character recognition


Linear Temporal
Encoder
Projection Pooling
Answer (text, logos)
ViT
Free throw happened
at 0.43-0.50, 1.20- • Inference metadata out for apps
Visual Tokens Fast/Slow LLM Caption
Timestamp 2.30
Tokens • Videos and images
Prompt
summarization
When did
Text
free-throws happen Tokenizer • Text response for Interactive
in the video? Q&A
Text • Timestamps, events, sequence
Tokens
"820627698-640": {

Data Preparation for LiTA "duration": 20.0,


"timestamps": [
[ 5.0, 6.0 ],

Fine-tuning ],
[ 11.0, 14.0 ]

"sentences": [
"Player in black passes to another player in black",
"Player in black shoots and makes a 2-point shot"
],
"events": [
"basketball pass",
"basketball shot"
]
}

Create Dense {
"video_id": "820627698-640",
Captioning "conversations": [
{
"role": "user",
"content": "Write a concise and clear dense caption for the provided sports video, focusing on key basketball
events such as 3-point shots, 2-point shots, free throws, and passes."
GPT4 },
{
"role": "assistant",
Generate Q&A "content": "<5.0s> <6.0s> A player in black passes the ball to another player in black. <11.0s> <14.0s> A
Pairs player in black shoots the ball and completes a 2-point shot."
},
{
"role": "user",
"content": "At what time does the player in black passes the ball?"
Register Dataset },
{
for Training "role": "assistant",
"content": "The player in black passes the ball to another player in black at <5.0s>."
},

...

]
}
Train
Fine-tuning LiTA for Custom Use Case

Supervised Fine-tuning with LoRA


Dataset Recommendation:
• 1000 videos for good fine-tuning (100 minimum)
Temporal
Vision Linear Pooling • 1–2 min. video clips
Encoder ViT Projection
• Events between 1% - 10% of the clip duration
Visual Fast/Slow LLM Answer
Tokens Tokens

Text
Prompt
Prompt
Tokenizer Training KPIs:
• Sports (basketball) action dataset
Text
Tokens • Trained ~1K images in 16 hours on 8xA100
- Frozen - Trainable

Model Size: 13.3B parameters


Trainable Parameters: 50M (with LoRA)

LoRA: Low-Rank Adaptation of LLMs


https://arxiv.org/pdf/2106.09685.pdf
LiTA fine-tuning

Before After Fine-Tuning

Question: What type of shot is this?


Question: What type of shot is this? Answer: This is a 2-pointer shot from the
Answer: This is a basketball shot corner
TensorRT-LLM Optimizing LLM and VLM Inference
SoTA performance for Large Language Models for production deployments

TensorRT-LLM - open-source library to optimize inference performance on the latest Large Language Models for NVIDIA GPUs.

A100 H100 TRT-LLM Static Inflight Batching


# define a new activation
def silu(input: Tensor) → Tensor:
4.6x return input * sigmoid(input)

#implement models like in DL FWs


class LlamaModel(Module)
def __init__(…)
self.layers = ModuleList([…]) 2x
3x def forward (…)
hidden = self.embedding(…)

for layer in self.layers: 5x


hidden_states = layer(hidden)

return hidden
Performance TCO Avg Latency Cost

SoTA Performance Ease Extension LLM Batching with Triton


Leverage TensorRT compilation & kernels from Add new operators or models in Python to quickly Maximize throughput and GPU utilization through new
Faster Transformers, CUTLASS, OAI Triton, ++ support new LLMs with optimized performance scheduling techniques for LLMs

Numbers are preliminary based on internal evaluation on Llama 7B on H100


TAO
Build Models for Any Platform With NVIDIA TAO TAO
Fast-track data generation, AI model creation and deployment

Data Train Deploy

Omniverse Replicator
Omniverse Replicator
Synthetic Data

ONNX
GPU CPU MCU DLA
TAO Any Inference Platform

Far-Edge | Edge | Cloud

Data
Omniverse Replicator Omniverse Replicator
Foundationl Models

Real Data Edge | Workstation | Cloud


ViT | Multi-modal
Announcing TAO Hosted APIs for Fine-tuning and Customization

NVIDIA
NGC

TAO Interface TAO API Server Supported FM (April)


Your Front-end NGC Controller
• NV-CLIP
UI Service
• NV-DINOv2
Developer NVIDIA TAO UI API API Handlers Cloud • Grounding-DINO

Workflow
TAO CLI Client AutoML
Manager Datasets Models
Future Support
(Q3 2024)
Data API
• VILA

NVIDIA DGX Cloud • LiTA


Datasets • EfficientViT SAM
Data

TAO TF1 TAO TF2 TAO PyT TAO TAO Data


Deploy Services Model
Summary and CTA

• Unlock new potential with multi- Tasks


modal foundation models–NV-CLIP,
Grounding-DINO, VILA, LiTA
Visual Change
• Fine-tuning essential for last-mile Detection
delivery Data

• NVIDIA AI tools and APIs for building Text Video/Image


and optimizing Generative AI models– Summarization
TensorRT, NeMo, TAO
Images
• TAO hosted APIs for fine-tuning and Video Q&A
customization
Adaption
Speech Foundation
Training Model Image
Sign-up for TAO API EA: NVIDIA TAO Captioning
Hosted API - Early Access Program Structured
Data
Object
Recognition
3D Signals

Image
Generation

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy