0% found this document useful (0 votes)
26 views29 pages

Day1-NVIDIA Data Center GPU-Leon-V1

The document outlines NVIDIA's Data Center GPU product lineup as of March 2023, highlighting various models such as A800, A30, A2, A40, A10, A16, and L40, each designed for specific workloads including AI training, inference, and high-performance graphics. Key features include improved performance metrics, energy efficiency, and support for multiple instances, catering to diverse applications from cloud gaming to scientific research. The GPUs leverage the latest Ampere and Ada Lovelace architectures, offering significant advancements in processing power and memory capabilities.

Uploaded by

meng-qingli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views29 pages

Day1-NVIDIA Data Center GPU-Leon-V1

The document outlines NVIDIA's Data Center GPU product lineup as of March 2023, highlighting various models such as A800, A30, A2, A40, A10, A16, and L40, each designed for specific workloads including AI training, inference, and high-performance graphics. Key features include improved performance metrics, energy efficiency, and support for multiple instances, catering to diverse applications from cloud gaming to scientific research. The GPUs leverage the latest Ampere and Ada Lovelace architectures, offering significant advancements in processing power and memory capabilities.

Uploaded by

meng-qingli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

NVIDIA Data Center GPU

March 2023
NVIDIA 数据中心产品组合
1

Highest Compute Perf AI, AI Inference & Mainstream Small Footprint Highest Graphics Perf High-Performance Highest Density
HPC, Data Processing Compute Datacenter & Edge AI Visual Computing Graphics with AI Virtual Desktop

DL Training Language Processing Edge AI & Small Inference Cloud Rendering Virtual Desktop Virtual Desktop

Scientific Research Conversational AI Edge Video Cloud XR and vWS Virtual Workstation Virtual Workstation

Data Analytics Recommender Systems Mobile Cloud Gaming Omniverse Cloud Gaming Transcoding

Versatile Mainstream Compute Entry-level inference 4K Resolution


Fastest Compute, FP64 up to Fastest RT Graphics 4K Cloud Gaming, Graphics
FP64, Up to 4 MIG instances Video & Graphics Max # of encode/decode
7 MIG instances Largest render models and Video with AI
Compact & Versatile streams

A800 A30 A2 A40 A10 A16


300W | 80GB 165W | 24GB 40-60W | 16GB 300W | 48GB 150W | 24GB 250W | 4x 16GB
2-Slot FHFL | Liquid | NVLink 2-Slot FHFL | NVLink 1-Slot Low Profile 2-Slot FHFL | NVLink | 3x DP 1-Slot FHFL 2-Slot FHFL

Compute Compute & Graphics Graphics & Compute


NVIDIA A800 Tensor Core GPU
A800 NVLINK & PCIE
A800 Tensor Core GPU 液冷版本
为AI大模型训练提供最高4倍加速
NVIDIA A30
主流企业服务器的多功能计算加速

专为推理和灵活企业计算而打造
• 20 倍 T4 AI 性能(A30 TF32 FLOPS 与 T4 FP32 相比)

多实例 GPU
每个 GPU 多达 4 个并行实例 (QoS)

计算
• 第三代 Tensor Core、快速 FP64

高带宽显存
超低延迟

节能高效
卓越的单位瓦特性能

稀疏度加速
实现高达 2 倍增速
从上一代过渡到 A30 的 3 个理由
NVIDIA Ampere 一代的卓越价值和性能

Higher Performance per $ MIG partitioning No changes in application SW stack


4 instances for QoS

Superior ROI Higher Performance & Utilization Easy Portability


with Ampere MIG
A30 FP64 TENSOR CORE 助力 HPC
与 Volta 相比,速度提升 30%

FP64 TFLOPS

19.5

10.3

V100 峰值 A30 峰值 A800 峰值


NVIDIA A2
入门级 GPU 将 NVIDIA AI 带到任何服务器

紧凑的入门级推理
• 单插槽 LP,低功耗 – 适用于任何服务器
• 热约束系统的最佳选择

最新的 Ampere 架构特性


• 第三代 Tensor 核心,第二代 RT 核心,Secure RoT

更高的智能视频分析 (IVA) 性能
• 性能比 T4 高 1.3 倍

性能比 CPU 高 20 倍
• 人工智能推理和云游戏的加速
从上一代过渡到 A2 的 3 个理由
NVIDIA Ampere 架构的卓越价值和更低功耗

Best Perf/$ For


Up To 40% Lower Power 30% Higher Iva Performance
Compact Edge

Superior TDP: A2 (40-60W) vs T4 A2 Superior Video


ROI (70W) Decode
提升推理性能
与仅使用 CPU 的服务器相比,性能提高多达 20 倍

Computer Vision Natural Language Processing Text-to-Speech

Computer Vision (EfficientDet-D0) NLP (BERT-Large) Text-to-Speech (Tacotron2 + Waveglow)

NVIDIA A2 8X NVIDIA A2 7X NVIDIA A2 20X

CPU 1X CPU 1X CPU 1X

0X 2X 4X 6X 8X 10X 0X 2X 4X 6X 8X 0X 5X 10X 15X 20X 25X


Inference Speedup Inference Speedup Inference Speedup

Comparisons of one NVIDIA A2 Tensor Core GPU versus a dual-socket Xeon Gold 6330N CPU

System Config: [CPU: HPE DL380 Gen10 Plus, 2S Xeon Gold 6330N @2.2GHz, 512GB DDR4]
Computer Vision: EfficientDet-D0 (COCO, 512x512) | TensorRT 8.2, Precision: INT8, BS:8 (GPU) | OpenVINO 2021.4, Precision: INT8, BS:8 (CPU)
NLP: BERT-Large (Sequence length: 384, SQuAD: v1.1) | TensorRT 8.2, Precision: INT8, BS:1 (GPU) | OpenVINO 2021.4, Precision: INT8, BS:1 (CPU)
Text-to-Speech: Tacotron2 + Waveglow E2E pipeline (input length: 128) | PyTorch 1.9, Precision: FP16, BS:1 (GPU) | PyTorch 1.9, Precision: FP32, BS:1 (CPU) NVIDIA CONFIDENTIAL – DO NOT DISTRIBUTE
更高的视频分析性能
A2 性能比 T4 高 1.3 倍

System Config: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 @2.6GHz, 512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4]
Measured performance with Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224). NVIDIA CONFIDENTIAL – DO NOT DISTRIBUTE
This IVA pipeline represents e2e performance with video capture and decode, pre-processing, batching, inference, and post-processing.
NVIDIA A40
视觉计算数据中心 GPU

NVIDIA Ampere Architecture CUDA Cores


Up to 2X FP32 throughput of previous generation*

2ndGeneration RT Cores
Up to 2X throughput of previous generation*

3rdGeneration Tensor Cores


Up to 5X throughput with TF32*

48 GB GDDR6 Memory
Largest frame buffer for professional graphics
• 3x Display Port 1.4 outputs**
• 2-way NVLink
PCIe Gen 4 • Quadro Sync support
• vGPU software support
2X bandwidth of PCIe Gen 3 • Hardware secure boot

*Performance measures gen to gen comparison of RTX 6000 to NVIDIA A40


** A40 is configured for virtualization by default with physical display connectors disabled. The display outputs can be enabled via management software tools.
NVIDIA L40
数据中心前所未有的视觉计算性能

• Next-generation CUDA Cores


• 4th generation Tensor Cores
• 3 generation RT Cores
rd

• 48 GB GDDR6 GPU Memory with ECC


• 300W
• Secure root of trust
NVIDIA L40 代际比较

NVIDIA L40 NVIDIA A40


GPU Architecture NVIDIA Ada Lovelace Architecture NVIDIA Ampere Architecture

FP32 90.5 TFLOPS 37.4 TFLOPS


RT Core 209 TFLOPS 73.1 TFLOPS
Tensor Float 32 (TF32) 90.5 | 181** TFLOPS 74.8 | 149.6* TFLOPS
BFLOAT16
Tensor Core
181 | 362** TFLOPS 149.7 | 299.4* TFLOPS

FP16 Tensor Core 181 | 362** TFLOPS 149.7 | 299.4* TFLOPS


FP8 Tensor Core 362 | 724** TFLOPS NA
INT8 Tensor Core 362 | 724** TOPS 299.3 | 598.6* TOPS
INT4 Tensor Core 724 | 1448** TOPS 598.7 | 1197.4* TOPS
GPU Memory 48 GB GDDR6 w/ ECC 48 GB GDDR6 w/ ECC
GPU Memory Bandwidth 864 GB/s 696 GB/s
Max Thermal Design Power
(TDP)
300 W 300 W

Form Factor 4.4” H x 10.5” L - Dual Slot 4.4” H x 10.5” L - Dual Slot

PCIe Gen4 x16: 64GB/s


Interconnect PCIe Gen4 x16: 64 GB/s
NVIDIA® NVLink® bridge for 2 GPUs:112.5 GB/s

Partner and NVIDIA-Certified Systems™, Partner and NVIDIA-Certified Systems™,


Server Options
NVIDIA® OVX™ NVIDIA® OVX™

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. * Preliminary specifications, subject to change.


** Structural sparsity enabled
跨其他不同工作负载的性能成倍提高
NVIDIA L40 为 Omniverse 性能优化

Relative Performance (Normalized to T4)


13x
12.0x
12x

11x

10x

9x
Performance Factor

8x
7.1x
7x

6x

5x 4.6x
4.2x
4x 3.8x 3.8x
3.6x
3.3x
3.0x 3.0x 3.1x
3x

2x 1.8x
1.1x 1.2x
1x

0x
Omniverse Model Size Gaming AI Training AI Inference Video Streaming Intelligent Content
(4K) (GB) (4K) (FP16) (INT8) (streams) Understanding
(streams)

A40 L40 T4

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Preliminary estimates, subject to change. L40 OV and Gaming enabled with DLSS3.
NVIDIA A10
High-performance graphics & video with AI

NVIDIA Ampere architecture


2nd gen RT Cores, 3rd gen Tensor Cores

24GB GDDR6 Memory


1.5X memory versus previous generation*

Improved Performance
Up to 2.5X faster graphics and inferencing*

High-density, Power Efficient


Single-slot form factor, 150W

Media Acceleration
AV1 Decode, multiple 4K streams, 8K HDR
Flexibly accelerate multiple data center workloads
Deploy virtual workstations & desktops or AI inference

*Gen to gen comparison of NVIDIA T4 to NVIDIA A10


适用于混合工作负载的 NVIDIA A10
NVIDIA A10 性能提升高达 2.5 倍,性价比最高

Up to 2.5X Faster Graphics Up to 2X Better Graphics Performance Up to 2.5X Better Inference


Performance 1 per Dollar1
Performance2
T4 A10+NVIDIA AI Enterprise A10
1.6X 2.5X 3.0X

1.4X
2.0X
2.0X
1.2X
Relative Performance

1.7X
2.0X
1.0X
1.5X

0.8X
1.0X
0.6X
2.5X 1.0X
1.0X

0.4X
0.5X

0.2X

0.0X 0.0X 0.0X


T4 A10 A40 T4 A40 A10 ResNet-50 v1.5 Inference BERT-Large Inference

1 Test run on a server with 2x Xeon Gold 6154 3.0GHz (3.7GHz Turbo), NVIDIA RTX vWS software, VMware ESXi 7 U2, host/guest driver 461.33. | SPECviewperf 2020 Subtest, and HD 3dsmax-07 composite.
2 BERT Large inference NVIDIA TensorRT7.2, Seq Length =128, batch size =128; NGC Container: 21.02-py3 | ResNet-50 v1.5: NVIDIA TensorRT7.2, INT8 precision batch size = 128 NGC Container: 20.12-py3 | NVIDIA A10 with NVIDIA AI Enterprise software, VMware ESXi 7 U2 host/guest
driver 461.33
NVIDIA A16
Unprecedented user experience and density for
graphics-rich VDI

Purpose-built for high user density


2X density versus previous generation1
Lowest Cost per virtual workstation user
Affordable entry virtual workstations2
4x 16GB GDDR6 Memory
Up to 64 multimedia-rich virtual desktops per board
Larger framebuffer per user for entry CAD virtual workstations3
Flexibility of heterogenous users
Simultaneously host different user profiles on one board
Highest Quality Video
Supports H.265 encode/decode, VP9, and AV1 decode
Multiuser performance for streaming video & multimedia
More than 2X encoder throughput1
Latest NVIDIA Ampere architecture
2nd gen RT cores, 3rd gen Tensor Cores
1. Gen to gen comparison of NVIDIA M10 to NVIDIA A16
2. Comparison of NVIDIA A16 vs. T4, RTX 6000, RTX 8000, and A40
3. Gen to gen comparison of NVIDIA T4 to NVIDIA A16
加速图形化应用
使用 NVIDIA vPC 提高生产力

Multiple, High Resolution Monitors Productivity Apps Video Conferencing Tools


Multiple monitor setups are becoming more Productivity apps are becoming more graphics Virtual meetings and classrooms enable users to
common intensive collaborate effectively

Multimedia Streaming Interactive Web Windows 10


YouTube, video training are standard for day-to- WebGL is prevalent and taxing to CPU utilization Increased graphics usage
day business needs
NVIDIA DATA CENTER GPUs

6 x NVIDIA T4 3 x NVIDIA M10 3 x NVIDIA A16


Density 96 users 96 users 192 users

Form Factor PCIe 3.0 single slot PCIe 3.0 dual slot PCIe 4.0 dual slot

Power 70W per GPU (420W) 225W per GPU (675W) 250W per GPU (750W)

CODECs VP9, H.265, H.264 H.264 VP9, H.265, H.264

System Memory
Support > 1TB < 1TB > 1TB

Use Case
Entry virtual workstations, virtual
Virtual Desktops for knowledge Lowest TCO for knowledge
desktops for knowledge workers, AI
workers workers
inferencing

NVIDIA CONFIDENTIAL – DO NOT DISTRIBUTE


提高用户密度并降低总体拥有成本

Up to 2X More Users per Server 1 Up to 30% Lower Cost per User 1

2.5X 2.0X

2.0X
2.0X
1.5X
1.3X
1.5X
1.1X
1.0X
1.0X
1.0X 1.0X
1.0X

0.5X
0.5X

0.0X 0.0X
T4 M10 A16 T4 M10 A16

1. Comparison of 6x NVIDIA T4 GPUs versus 3x NVIDIA M10 GPUs versus 3x NVIDIA A16 GPUs per server, assuming 1GB profile per user.

2. Comparison of a configured server with 6x T4 versus 3x M10 versus 3x A16 GPUs.


H800 通过 NVIDIA DGX H800、H800 PCIe 交付

H800 PCIe HGX H800 8-GPU DGX H800

• 1-8 GPUs per server, optional • 8 H800s, Full NVLINK B/W • 8 H800s SXM, Full NVLINK
NVLink Bridge for up to 2 GPUs between all GPUs B/W between all GPUs
• 80GB • 640GB • 640GB
• NVIDIA AI Enterprise included • NVIDIA Base Command
Software with NVIDIA AI
Enterprise included
NVIDIA H800 PCIE
主流服务器前所未有的性能、可扩展性和安全性

Highest AI and HPC Mainstream Performance​


3PF FP8 (5X)| 1.5PF FP16 (2.4X)| 756TF TF32 (2.4X)| 51TF FP64 (2.6X)
使用 DPX 指令的动态编程速度提高 6 倍
2TB/秒,80GB HBM2e 内存

Highest Compute Energy Efficiency


可配置的 TDP - 200W 至 350W
2 插槽 FHFL 主流外形

Highest Utilization Efficiency and Security


7 个完全隔离和安全的实例,保证 QoS
第二代 MIG | 机密计算

Highest Performing Server Connectivity


128GB/s PCI Gen5
600 GB/s GPU-2-GPU 连接(5X PCIe Gen5)
最多 2 个带 NVLink Bridge 的 GPU

FP8, FP16, TF32 performance include sparsity. X-factor compared to A800


NVIDIA H800 AI 性能

Inference Performance DL Training Performance

Inference: x1 A800 | x1 H800 workloads use MLPerf settings for server with latency target | BERT-Large 99.9% gain from H800 FP8 vs A800 FP16
Training: HGX A800 v HGX H800 Mask-RCNN, BERT-Large , FP16, max batch size | 1K A800 vs 1K H800 GPT3-175B Transformer Engine with FP8, BS=2048 with Tensor Parallel/ Pipeline Parallel/ Data Parallel = 1/32/32
NVIDIA认证系统
简化加速计算的大规模部署

SYSTEM DESIGN OPTIONS

Validates the Best Baseline


Configuration for

NVIDIA SERVER GPUs NVIDIA SMARTNICs AND DPUs LEADING PARTNER SERVERS

PERFORMANCE MANAGEABILITY

NVIDIA WORKSTATION GPUs LEADING PARTNER LAPTOPS AND DESKTOPS SECURITY SCALABILITY

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy