Create Purpose-Built AI Using Vision and Language
Create Purpose-Built AI Using Vision and Language
Speech Foundation
Training Model
Benefits
• Highly generalizable
Structured
Data
• Highly versatile and adaptable across many domains
Reference Test
Synthetic Data
Stable Diffusion FM
Real Data
Foundation Model
Customization Fine-tuned
Model
Foundation Model
Model Customization
Full Fine-tuning
Update weights of entire
model including the
foundation backbone
Defect Inspection Video Agent
Foundation Model Your Custom Model
In-context Learning
Use visual prompting and
model chaining to improve
contextual awareness
Optimizing Foundation Models for Inference
FP32
(Pre-Trained)
Teacher Model
FP16
INT8
Backprop Distillation
INT4
Original Network Pruned Network
Loss
Model Purpose
NV-DINOv2 Vision-only backbone for downstream Vision AI tasks–image classification, detection, segmentation
Image-text matching model that can align image features with text features; backbone for downstream Vision AI
NV-CLIP
tasks–image classification, detection, segmentation
Faster, more efficient version of SAM (Segment Any Model), which is a visual FM for segmenting any object based on
EfficientViT-SAM
different types of visual prompts such as single coordinate or bounding box
VILA Family of visual language model for image and video understanding and Q&A
LITA Visual language model for video understanding and context with spatial and temporal localization
Foundation Pose 6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box
Sensor fusion model which fuses multiple input sensors–cameras, LiDAR, radar, etc. to create a bird’s eye view of the
BEVFusion
scene with 3D bounding box representation of the objects
NeVa Multi-modal visual language model for image understanding and Q&A
LiDARSAM Segment any object based on user provided text prompts on 3D point cloud LiDAR data
Multi-Modal Foundation Backbone - NV-CLIP
††
… ViT-H 77.4 78.0
T1 T2 T3 … TN †
- Non-commercial use only
††
- Trained on 700M image-text pair vs. 2B for CLIP
3000
… … … … … … …
2000
0
ViT-B ViT-H
Commercially Viable Trained on Very Large Dataset Foundation Backbone for Vision AI
Trained on ethically sourced data and compares Trained on 700M image-text pairs for text and Used in many downstream vision tasks like zero-shot
favorably to other non-commercial public models image embeddings detection/segmentation, VLMs and more
Foundation
Backbones
NV-DINO / Zero-shot Class label, b-box, mask, text
Data NV-CLIP
Diffusion Image
Fine-Tune in 100 or Less Samples for Image Classification
Multi-class Multi-class
NV-DINOv2 Adapter
98%
Satellite Imagery Segmentation
Industrial Defect
Classification 100%
Detection
MLP
Golden Image Decoder
Indoor Warehouse Multi-class
93.5%
Scene Segmentation
Image
Image Features Image Features
Backbone
Swin-B
Feature Query
Enhancer Selection
Cross-
Prompt modalities
Text Queries
Text Features Text Features
“Person wearing Backbone
glasses” BERT-B Cross-modality
Decoder
Language
Cross-modality
Guided
Feature Fusion
Use Cases:
Grounding Annotation
{
"grounding": {
"caption": "a wire hanger with a paper cover ...",
"regions": [
{
Image "bbox": [20,215,985,665],
Backbone
BERT-B "phrase": "a paper cover that reads we heart ..."
},
...
Feature
Enhancer
Query
Selection
]
}
Prompt
Augmenting LLMs With Visual Input SOTA Accuracy for Visual Language Optimizing Inference With 4-bit AWQ
180
160
140
120
Tokens/sec
100
80
60
40
20
0
VILA-7B-AWQ VILA-13B-AWQ
RTX 4090 A100 AGX Orin
https://arxiv.org/pdf/2312.07533.pdf
Improving Spatial and Contextual Awareness Using SoM Prompting
Prompt
Prompt
There are 1-8 numeric IDs that are
person class. Which numeric ID is in
Person most danger in this image
Answer
VLM Capabilities
Event @t1 Event @t2 … Event @tN • Spatial and temporal localization
• Contextual understanding
Fine-tuning ],
[ 11.0, 14.0 ]
"sentences": [
"Player in black passes to another player in black",
"Player in black shoots and makes a 2-point shot"
],
"events": [
"basketball pass",
"basketball shot"
]
}
Create Dense {
"video_id": "820627698-640",
Captioning "conversations": [
{
"role": "user",
"content": "Write a concise and clear dense caption for the provided sports video, focusing on key basketball
events such as 3-point shots, 2-point shots, free throws, and passes."
GPT4 },
{
"role": "assistant",
Generate Q&A "content": "<5.0s> <6.0s> A player in black passes the ball to another player in black. <11.0s> <14.0s> A
Pairs player in black shoots the ball and completes a 2-point shot."
},
{
"role": "user",
"content": "At what time does the player in black passes the ball?"
Register Dataset },
{
for Training "role": "assistant",
"content": "The player in black passes the ball to another player in black at <5.0s>."
},
...
]
}
Train
Fine-tuning LiTA for Custom Use Case
Text
Prompt
Prompt
Tokenizer Training KPIs:
• Sports (basketball) action dataset
Text
Tokens • Trained ~1K images in 16 hours on 8xA100
- Frozen - Trainable
TensorRT-LLM - open-source library to optimize inference performance on the latest Large Language Models for NVIDIA GPUs.
return hidden
Performance TCO Avg Latency Cost
Omniverse Replicator
Omniverse Replicator
Synthetic Data
ONNX
GPU CPU MCU DLA
TAO Any Inference Platform
Data
Omniverse Replicator Omniverse Replicator
Foundationl Models
NVIDIA
NGC
Workflow
TAO CLI Client AutoML
Manager Datasets Models
Future Support
(Q3 2024)
Data API
• VILA
Image
Generation