Day1-NVIDIA Data Center GPU-Leon-V1
Day1-NVIDIA Data Center GPU-Leon-V1
March 2023
NVIDIA 数据中心产品组合
1
Highest Compute Perf AI, AI Inference & Mainstream Small Footprint Highest Graphics Perf High-Performance Highest Density
HPC, Data Processing Compute Datacenter & Edge AI Visual Computing Graphics with AI Virtual Desktop
DL Training Language Processing Edge AI & Small Inference Cloud Rendering Virtual Desktop Virtual Desktop
Scientific Research Conversational AI Edge Video Cloud XR and vWS Virtual Workstation Virtual Workstation
Data Analytics Recommender Systems Mobile Cloud Gaming Omniverse Cloud Gaming Transcoding
专为推理和灵活企业计算而打造
• 20 倍 T4 AI 性能(A30 TF32 FLOPS 与 T4 FP32 相比)
多实例 GPU
每个 GPU 多达 4 个并行实例 (QoS)
计算
• 第三代 Tensor Core、快速 FP64
高带宽显存
超低延迟
节能高效
卓越的单位瓦特性能
稀疏度加速
实现高达 2 倍增速
从上一代过渡到 A30 的 3 个理由
NVIDIA Ampere 一代的卓越价值和性能
FP64 TFLOPS
19.5
10.3
紧凑的入门级推理
• 单插槽 LP,低功耗 – 适用于任何服务器
• 热约束系统的最佳选择
更高的智能视频分析 (IVA) 性能
• 性能比 T4 高 1.3 倍
性能比 CPU 高 20 倍
• 人工智能推理和云游戏的加速
从上一代过渡到 A2 的 3 个理由
NVIDIA Ampere 架构的卓越价值和更低功耗
Comparisons of one NVIDIA A2 Tensor Core GPU versus a dual-socket Xeon Gold 6330N CPU
System Config: [CPU: HPE DL380 Gen10 Plus, 2S Xeon Gold 6330N @2.2GHz, 512GB DDR4]
Computer Vision: EfficientDet-D0 (COCO, 512x512) | TensorRT 8.2, Precision: INT8, BS:8 (GPU) | OpenVINO 2021.4, Precision: INT8, BS:8 (CPU)
NLP: BERT-Large (Sequence length: 384, SQuAD: v1.1) | TensorRT 8.2, Precision: INT8, BS:1 (GPU) | OpenVINO 2021.4, Precision: INT8, BS:1 (CPU)
Text-to-Speech: Tacotron2 + Waveglow E2E pipeline (input length: 128) | PyTorch 1.9, Precision: FP16, BS:1 (GPU) | PyTorch 1.9, Precision: FP32, BS:1 (CPU) NVIDIA CONFIDENTIAL – DO NOT DISTRIBUTE
更高的视频分析性能
A2 性能比 T4 高 1.3 倍
System Config: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 @2.6GHz, 512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4]
Measured performance with Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224). NVIDIA CONFIDENTIAL – DO NOT DISTRIBUTE
This IVA pipeline represents e2e performance with video capture and decode, pre-processing, batching, inference, and post-processing.
NVIDIA A40
视觉计算数据中心 GPU
2ndGeneration RT Cores
Up to 2X throughput of previous generation*
48 GB GDDR6 Memory
Largest frame buffer for professional graphics
• 3x Display Port 1.4 outputs**
• 2-way NVLink
PCIe Gen 4 • Quadro Sync support
• vGPU software support
2X bandwidth of PCIe Gen 3 • Hardware secure boot
Form Factor 4.4” H x 10.5” L - Dual Slot 4.4” H x 10.5” L - Dual Slot
11x
10x
9x
Performance Factor
8x
7.1x
7x
6x
5x 4.6x
4.2x
4x 3.8x 3.8x
3.6x
3.3x
3.0x 3.0x 3.1x
3x
2x 1.8x
1.1x 1.2x
1x
0x
Omniverse Model Size Gaming AI Training AI Inference Video Streaming Intelligent Content
(4K) (GB) (4K) (FP16) (INT8) (streams) Understanding
(streams)
A40 L40 T4
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Preliminary estimates, subject to change. L40 OV and Gaming enabled with DLSS3.
NVIDIA A10
High-performance graphics & video with AI
Improved Performance
Up to 2.5X faster graphics and inferencing*
Media Acceleration
AV1 Decode, multiple 4K streams, 8K HDR
Flexibly accelerate multiple data center workloads
Deploy virtual workstations & desktops or AI inference
1.4X
2.0X
2.0X
1.2X
Relative Performance
1.7X
2.0X
1.0X
1.5X
0.8X
1.0X
0.6X
2.5X 1.0X
1.0X
0.4X
0.5X
0.2X
1 Test run on a server with 2x Xeon Gold 6154 3.0GHz (3.7GHz Turbo), NVIDIA RTX vWS software, VMware ESXi 7 U2, host/guest driver 461.33. | SPECviewperf 2020 Subtest, and HD 3dsmax-07 composite.
2 BERT Large inference NVIDIA TensorRT7.2, Seq Length =128, batch size =128; NGC Container: 21.02-py3 | ResNet-50 v1.5: NVIDIA TensorRT7.2, INT8 precision batch size = 128 NGC Container: 20.12-py3 | NVIDIA A10 with NVIDIA AI Enterprise software, VMware ESXi 7 U2 host/guest
driver 461.33
NVIDIA A16
Unprecedented user experience and density for
graphics-rich VDI
Form Factor PCIe 3.0 single slot PCIe 3.0 dual slot PCIe 4.0 dual slot
Power 70W per GPU (420W) 225W per GPU (675W) 250W per GPU (750W)
System Memory
Support > 1TB < 1TB > 1TB
Use Case
Entry virtual workstations, virtual
Virtual Desktops for knowledge Lowest TCO for knowledge
desktops for knowledge workers, AI
workers workers
inferencing
2.5X 2.0X
2.0X
2.0X
1.5X
1.3X
1.5X
1.1X
1.0X
1.0X
1.0X 1.0X
1.0X
0.5X
0.5X
0.0X 0.0X
T4 M10 A16 T4 M10 A16
1. Comparison of 6x NVIDIA T4 GPUs versus 3x NVIDIA M10 GPUs versus 3x NVIDIA A16 GPUs per server, assuming 1GB profile per user.
• 1-8 GPUs per server, optional • 8 H800s, Full NVLINK B/W • 8 H800s SXM, Full NVLINK
NVLink Bridge for up to 2 GPUs between all GPUs B/W between all GPUs
• 80GB • 640GB • 640GB
• NVIDIA AI Enterprise included • NVIDIA Base Command
Software with NVIDIA AI
Enterprise included
NVIDIA H800 PCIE
主流服务器前所未有的性能、可扩展性和安全性
Inference: x1 A800 | x1 H800 workloads use MLPerf settings for server with latency target | BERT-Large 99.9% gain from H800 FP8 vs A800 FP16
Training: HGX A800 v HGX H800 Mask-RCNN, BERT-Large , FP16, max batch size | 1K A800 vs 1K H800 GPT3-175B Transformer Engine with FP8, BS=2048 with Tensor Parallel/ Pipeline Parallel/ Data Parallel = 1/32/32
NVIDIA认证系统
简化加速计算的大规模部署
NVIDIA SERVER GPUs NVIDIA SMARTNICs AND DPUs LEADING PARTNER SERVERS
PERFORMANCE MANAGEABILITY
NVIDIA WORKSTATION GPUs LEADING PARTNER LAPTOPS AND DESKTOPS SECURITY SCALABILITY