0% found this document useful (0 votes)

30 views34 pages

Internal and Architecture

The document discusses Azure Synapse architecture and best practices for data distribution, types, and table design in an Azure Synapse data warehouse. It covers topics like MPP, billing, data distribution techniques like hash, round-robin and replicate. It also discusses table types, partitioning, and provides best practices for designing fact and dimension tables. It demonstrates analyzing data distribution in an on-premises data warehouse before migrating to Azure Synapse.

Uploaded by

Renganathan Umanath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views34 pages

Internal and Architecture

Uploaded by

Renganathan Umanath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Eshant Garg

Data Engineer, Architect, Advisor

eshant.garg@gmail.com
Introduction

MPP or Massive Parallel Processing

Storage & Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types (Columstore, Heap, Clustered B-tree index)
Partitioning and Distribution key
Applications in Dimensional modeling
Demo – Table Analysis before Migration to Cloud
Azure Synapse MPP Architecture

DWU Loading Ran

3 Tables Report
100 15 20
500 3 4

Source: Microsoft
Azure Storage and Distribution

SQL DW charges separately for storage consumption

A distribution is the basic unit of storage and processing for parallel

queries

Rows are stored across 60 distributions which are run in parallel

Each compute node manages one or more of the 60 distribution

Sharding Patterns
Replicated Tables

• Caches a full copy on each compute node.

• Used for small tables

CREATE TABLE [dbo].[BusinessHierarchies](

[BookId] [nvarchar](250) ,
[Division] [nvarchar](100) ,
[Cluster] [nvarchar](100) ,
[Desk] [nvarchar](100) ,
[Book] [nvarchar](100) ,
[Volcker] [nvarchar](100) ,
[Region] [nvarchar](100)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = REPLICATE
)
;

Source: Microsoft
Round Robin tables
CREATE TABLE [dbo].[Dates](
[Date] [datetime2](3) ,
[DateKey] [decimal](38, 0) ,
..
..
[WeekDay] [nvarchar](100) ,
[Day Of Month] [decimal](38, 0)
)

WITH
(
• Generally use to load staging tables CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = ROUND_ROBIN
• Distribute data evenly across the table without )
;
additional optimization
• Joins are slow, because it requires to reshuffle data
• Default distribution type

Source: Microsoft
Hash Distribution Tables

• Highest performance for large tables

• Each row belong to one particular distribution
• It is used mostly for larger tables

Source: Microsoft
Hash Distribution Tables

Record Product Store

1 Soccer New York
2 Soccer Los Angeles
3 Football Phoenix
Hash Distribution Tables
• Highest performance for large tables
• Each row belong to one particular
distribution
• It is used mostly for larger tables

CREATE TABLE [dbo].[EquityTimeSeriesData](

[Date] [varchar](30) ,
[BookId] [decimal](38, 0) ,
[P&L] [decimal](31, 7) ,
[VaRLower] [decimal](31, 7)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([P&L])
)
;

Source: Microsoft
Avoid Data Skew
Even Distribution
Determines the method in which Azure SQL Data Warehouse spreads the data
across multiple nodes.

Azure SQL Data Warehouse uses up to 60 distributions when loading data into the
system.
Good Hash Key

Has more than

Distributes
60 distinct
Evenly
values

Is Not Used for

Updated Grouping

Used as Join
condition
What Data Distribution to Use?
Type Great fit for Watch out if…

Replicated Small-dimension tables in a • Many write transaction are on the table

star schema with less than (insert/update/delete)
2GB of storage after • You change DWU provisioning frequently
compression • You use only 2-3 columns, but your table has
many columns
• You index a replicated table

Round-robin (default) • Temporary/Staging table Performance is slow due to data movement

• No obvious joining key or
good candidate column.

hash • Fact tables The distribution key can’t be updated

• Large dimension tables
Data types

Use the smallest data type which will support your data

Avoid defining all character columns to a large default

length

Define columns as VARCHAR rather than NVARCHAR if

you don’t need Unicode
Data types

The goal is to not only save space but also move data as efficiently as possible.
Data types

Some complex data types (XML, geography, etc)

are not supported on Azure SQL Data
Warehouse yet.
Table types
Clustered • Updateable primary storage method
columnstore • Great for read-only

• Data is not in any particular order.

Heap • Use when data has no natural order.

• An index that is physically stored in the same

Clustered Index order as the data being indexed
High compression
Default table type
ratio

Clustered
columnstore

Ideally segments of No Secondary

1M rows Indexes
No index on the data Fast Load

Heap

Allows secondary
No compression
indexes
Sorted index on the data Fast singleton lookup

Clustered
B-Tree

Allows secondary
No compression
indexes
Table Partitioning
Table
Partitioning
Table partitions enable you to divide your data into
smaller groups of data
Improve the efficiency and performance of loading data
by use of partition deletion, switching and merging
Usually data is partitioned on a date column tied to when
the data is loaded into the database

Can also be used to improve query performance

Why Partitioning?
Partitions best practices

Creating a table Too many partitions can hurt

performance under some circumstances

Usually a successful partitioning scheme has 10 or a few

hundred partitions

Clustered column store tables, it is important to consider

how many rows belong to each partition

Before partitions are created, SQL Data warehouse

already divides each table into 60 distributed databases
A highly granular partitioning scheme can work
in SQL Server but hurt performance in Azure
SQL Data Warehouse.
Example

60 Distributions 365 Partitions 21900 Data Buckets

21900 Data Buckets Ideal Segment 21 900 000 000 Rows

Size (1M Rows)
Lower Granularity (week, month)
can perform better depending on
how much data you have.
Fact Tables

Large ones are better as Columnstores

Distributed through Hash key as much as

possible as long as it is even
Partitioned only if the table is large
enough to fill up each segment
Dimension Tables

Can be Hash distributed or Round-Robin if there is no clear candidate join key

Columnstore for large dimensions

Heap or Clustered Index for small dimensions

Add secondary indexes for alternate join columns

Partitioning not recommended

DEMO
Analyse data distribution at On-premises Datawarehouse before migrating to
Azure Synapse Data Pool.

• We will use Microsoft’s AdventureworksDW database as on-premises data warehouse.

• We will analyse one dimension and one fact table.
• Same process can be repeated to other tables of on-premises database.
Summary
MPP or Massive Parallel Processing
Billing = Compute + Storage
Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types
Partitioning Data
Best practice – Fact and Dimension table design
Demo – Analyse Data Distribution

DP 300notes241025
No ratings yet
DP 300notes241025
159 pages
Azure SQL DWH Part1 1665371763
No ratings yet
Azure SQL DWH Part1 1665371763
200 pages
DP-201 Answers and Explanation
No ratings yet
DP-201 Answers and Explanation
215 pages
Imp Links
No ratings yet
Imp Links
33 pages
Azure Synapse - Cloud Data Analytics
No ratings yet
Azure Synapse - Cloud Data Analytics
33 pages
Azure Cloud Ch.1 & 2
No ratings yet
Azure Cloud Ch.1 & 2
27 pages
Distributed Table Concepts
No ratings yet
Distributed Table Concepts
3 pages
Table Optimizations
No ratings yet
Table Optimizations
31 pages
Whiz Cheat Sheet DP 203 v2
No ratings yet
Whiz Cheat Sheet DP 203 v2
42 pages
DP-203 Exam - Free Actual Q&As, Page 7 - ExamTopics
No ratings yet
DP-203 Exam - Free Actual Q&As, Page 7 - ExamTopics
11 pages
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
No ratings yet
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
25 pages
DP-203 Exam - Free Actual Q&as, Page 4 - ExamTopics
No ratings yet
DP-203 Exam - Free Actual Q&as, Page 4 - ExamTopics
11 pages
Distributions in Azure Synpase
No ratings yet
Distributions in Azure Synpase
12 pages
MIE1628 Big Data Analytics Lecture7
No ratings yet
MIE1628 Big Data Analytics Lecture7
77 pages
Mongo-Sharding and Replication
No ratings yet
Mongo-Sharding and Replication
8 pages
Data50 2020 02 - Feb 02
No ratings yet
Data50 2020 02 - Feb 02
26 pages
Microsoft Azure Fundamentals
No ratings yet
Microsoft Azure Fundamentals
366 pages
Modern Javascript v1
No ratings yet
Modern Javascript v1
55 pages
Azure Data Fundamentals Explore Non Relational Data in Azure - Explore Non-Relational Data Offerings in Azure
No ratings yet
Azure Data Fundamentals Explore Non Relational Data in Azure - Explore Non-Relational Data Offerings in Azure
20 pages
U4 - 5 I o Parallelism
No ratings yet
U4 - 5 I o Parallelism
8 pages
Explore Azure Tables
No ratings yet
Explore Azure Tables
2 pages
2018 05 24 Kathryn Varralls Modern Data Warehouse Presentation
No ratings yet
2018 05 24 Kathryn Varralls Modern Data Warehouse Presentation
29 pages
IO Parallelism
No ratings yet
IO Parallelism
4 pages
Relational Databases
No ratings yet
Relational Databases
368 pages
03 - Partitioning Basics
No ratings yet
03 - Partitioning Basics
33 pages
Data Mining Questions
No ratings yet
Data Mining Questions
9 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
Partitioning Method
No ratings yet
Partitioning Method
8 pages
Windows Azure Table May 2009
No ratings yet
Windows Azure Table May 2009
38 pages
The Database Knowledgebase On The Web: Database Wisdom: General - Oracle 11g Partitioni..
No ratings yet
The Database Knowledgebase On The Web: Database Wisdom: General - Oracle 11g Partitioni..
4 pages
Microsoft - Strategies For Partitioning Relational Data Warehouses in SQL Server
No ratings yet
Microsoft - Strategies For Partitioning Relational Data Warehouses in SQL Server
27 pages
p64 Stonebraker PDF
No ratings yet
p64 Stonebraker PDF
8 pages
Oracle Partitioning For Developers
No ratings yet
Oracle Partitioning For Developers
70 pages
Azure Storage
No ratings yet
Azure Storage
9 pages
Relational Databases
No ratings yet
Relational Databases
374 pages
SQL DW
No ratings yet
SQL DW
596 pages
Azure Data Fundamentals
No ratings yet
Azure Data Fundamentals
56 pages
Oracle Partitioning in Oracle Database 11g
No ratings yet
Oracle Partitioning in Oracle Database 11g
47 pages
Fundamentals of Database Systems: (Parallel and Distributed Databases)
No ratings yet
Fundamentals of Database Systems: (Parallel and Distributed Databases)
46 pages
Partitions (Analysis Services - Multidimensional Data) : SQL Server 2012
No ratings yet
Partitions (Analysis Services - Multidimensional Data) : SQL Server 2012
4 pages
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
No ratings yet
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
39 pages
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
No ratings yet
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
39 pages
Create Cluster: Purpose
No ratings yet
Create Cluster: Purpose
24 pages
Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte
No ratings yet
Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte
41 pages
Partitioning in Oracle
No ratings yet
Partitioning in Oracle
5 pages
Implementing An Azure SQL Data Warehouse
No ratings yet
Implementing An Azure SQL Data Warehouse
41 pages
SQ L Questions by Lips A
No ratings yet
SQ L Questions by Lips A
25 pages
Parallel Databases
No ratings yet
Parallel Databases
19 pages
Tables
No ratings yet
Tables
1 page
Love Sick
100% (1)
Love Sick
627 pages
Caves Server Log
No ratings yet
Caves Server Log
18 pages
Warner DP 203 Slides
No ratings yet
Warner DP 203 Slides
98 pages
Azure Data Engineer
100% (4)
Azure Data Engineer
54 pages
(Ebook PDF) Making Content Comprehensible For English Learners: The SIOP Model 5th Edition PDF Download
100% (1)
(Ebook PDF) Making Content Comprehensible For English Learners: The SIOP Model 5th Edition PDF Download
46 pages
DBS Finalized Scripts
No ratings yet
DBS Finalized Scripts
7 pages
Oracle 11g Partitioning
No ratings yet
Oracle 11g Partitioning
11 pages
On Varieties of English and Language Registers
No ratings yet
On Varieties of English and Language Registers
28 pages
Oracle Partitioning
No ratings yet
Oracle Partitioning
6 pages
SQL Server Clustered Index Design For Performance
No ratings yet
SQL Server Clustered Index Design For Performance
17 pages
DW Basic Questions
No ratings yet
DW Basic Questions
9 pages
Viral Dubai Chocolate News Review by JForrest English
No ratings yet
Viral Dubai Chocolate News Review by JForrest English
5 pages
HL Calculus 1 Notes
No ratings yet
HL Calculus 1 Notes
12 pages
Gmail User Manual Basic Operation en Pdf1
No ratings yet
Gmail User Manual Basic Operation en Pdf1
24 pages
Netezza Best Practices
No ratings yet
Netezza Best Practices
5 pages
Implementing Rapidly Changing Dimension: What Are Fast Changing Dimensions?
No ratings yet
Implementing Rapidly Changing Dimension: What Are Fast Changing Dimensions?
5 pages
Aen 100 PPT Lec 8 and 9 PDF
No ratings yet
Aen 100 PPT Lec 8 and 9 PDF
30 pages
Lecture WRITING A RESEARCH PROPOSAL
No ratings yet
Lecture WRITING A RESEARCH PROPOSAL
16 pages
Remote File Inclusion
0% (1)
Remote File Inclusion
7 pages
2023 JHS Scheme of Learning
No ratings yet
2023 JHS Scheme of Learning
37 pages
Mariana
No ratings yet
Mariana
11 pages
Electric Machines and Power Electronics
100% (2)
Electric Machines and Power Electronics
58 pages
PDF Hostel Management System
0% (1)
PDF Hostel Management System
12 pages
Introduction To Ict
No ratings yet
Introduction To Ict
8 pages
SWT3000 Product Brochure Solution 0514
No ratings yet
SWT3000 Product Brochure Solution 0514
12 pages
Write A Short Story About A Strange Dream You Had
No ratings yet
Write A Short Story About A Strange Dream You Had
2 pages
The Tenor Voice by Anthony Frisell Review By: Philip L. Miller Notes, Second Series, Vol. 22, No. 2 (Winter, 1965 - Winter, 1966), P. 903 Published By: Stable URL: Accessed: 14/06/2014 02:23
No ratings yet
The Tenor Voice by Anthony Frisell Review By: Philip L. Miller Notes, Second Series, Vol. 22, No. 2 (Winter, 1965 - Winter, 1966), P. 903 Published By: Stable URL: Accessed: 14/06/2014 02:23
2 pages
Kid Presidents: Educator's Guide
100% (2)
Kid Presidents: Educator's Guide
3 pages
Ops English Lang. Module
No ratings yet
Ops English Lang. Module
12 pages
Quiz 003 - Attempt Review PDF
No ratings yet
Quiz 003 - Attempt Review PDF
3 pages
Maguire Mackenzie Resume
No ratings yet
Maguire Mackenzie Resume
2 pages
TOPIC 4 A
No ratings yet
TOPIC 4 A
32 pages
TIBCO Ems Commands
No ratings yet
TIBCO Ems Commands
4 pages
Sap Tables
No ratings yet
Sap Tables
5 pages
MATH 1150 - Assign #5
No ratings yet
MATH 1150 - Assign #5
1 page
FILE - 20201026 - 135229 - Intro To Translation Studies Revision Questions LOP TRIET - QUYEN
No ratings yet
FILE - 20201026 - 135229 - Intro To Translation Studies Revision Questions LOP TRIET - QUYEN
29 pages
TOEFL Ibt Mugen
No ratings yet
TOEFL Ibt Mugen
2 pages
First Quarter Grasps For Performance Task #1: Writing Speech Choir Piece
No ratings yet
First Quarter Grasps For Performance Task #1: Writing Speech Choir Piece
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Internal and Architecture

Uploaded by

Internal and Architecture

Uploaded by

Eshant Garg

Data Engineer, Architect, Advisor

MPP or Massive Parallel Processing

DWU Loading Ran

SQL DW charges separately for storage consumption

A distribution is the basic unit of storage and processing for parallel

Rows are stored across 60 distributions which are run in parallel

Each compute node manages one or more of the 60 distribution

• Caches a full copy on each compute node.

CREATE TABLE [dbo].[BusinessHierarchies](

• Highest performance for large tables

Record Product Store

CREATE TABLE [dbo].[EquityTimeSeriesData](

Has more than

Is Not Used for

Replicated Small-dimension tables in a • Many write transaction are on the table

Round-robin (default) • Temporary/Staging table Performance is slow due to data movement

hash • Fact tables The distribution key can’t be updated

Avoid defining all character columns to a large default

Define columns as VARCHAR rather than NVARCHAR if

Some complex data types (XML, geography, etc)

• Data is not in any particular order.

• An index that is physically stored in the same

Ideally segments of No Secondary

Can also be used to improve query performance

Creating a table Too many partitions can hurt

Usually a successful partitioning scheme has 10 or a few

Clustered column store tables, it is important to consider

Before partitions are created, SQL Data warehouse

60 Distributions 365 Partitions 21900 Data Buckets

21900 Data Buckets Ideal Segment 21 900 000 000 Rows

Large ones are better as Columnstores

Distributed through Hash key as much as

Can be Hash distributed or Round-Robin if there is no clear candidate join key

Columnstore for large dimensions

Heap or Clustered Index for small dimensions

Add secondary indexes for alternate join columns

Partitioning not recommended

• We will use Microsoft’s AdventureworksDW database as on-premises data warehouse.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.