0% found this document useful (0 votes)
30 views34 pages

Internal and Architecture

The document discusses Azure Synapse architecture and best practices for data distribution, types, and table design in an Azure Synapse data warehouse. It covers topics like MPP, billing, data distribution techniques like hash, round-robin and replicate. It also discusses table types, partitioning, and provides best practices for designing fact and dimension tables. It demonstrates analyzing data distribution in an on-premises data warehouse before migrating to Azure Synapse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views34 pages

Internal and Architecture

The document discusses Azure Synapse architecture and best practices for data distribution, types, and table design in an Azure Synapse data warehouse. It covers topics like MPP, billing, data distribution techniques like hash, round-robin and replicate. It also discusses table types, partitioning, and provides best practices for designing fact and dimension tables. It demonstrates analyzing data distribution in an on-premises data warehouse before migrating to Azure Synapse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Eshant Garg

Data Engineer, Architect, Advisor


eshant.garg@gmail.com
Introduction

MPP or Massive Parallel Processing


Storage & Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types (Columstore, Heap, Clustered B-tree index)
Partitioning and Distribution key
Applications in Dimensional modeling
Demo – Table Analysis before Migration to Cloud
Azure Synapse MPP Architecture

DWU Loading Ran


3 Tables Report
100 15 20
500 3 4

Source: Microsoft
Azure Storage and Distribution

SQL DW charges separately for storage consumption

A distribution is the basic unit of storage and processing for parallel


queries

Rows are stored across 60 distributions which are run in parallel

Each compute node manages one or more of the 60 distribution


Sharding Patterns
Replicated Tables

• Caches a full copy on each compute node.


• Used for small tables

CREATE TABLE [dbo].[BusinessHierarchies](


[BookId] [nvarchar](250) ,
[Division] [nvarchar](100) ,
[Cluster] [nvarchar](100) ,
[Desk] [nvarchar](100) ,
[Book] [nvarchar](100) ,
[Volcker] [nvarchar](100) ,
[Region] [nvarchar](100)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = REPLICATE
)
;

Source: Microsoft
Round Robin tables
CREATE TABLE [dbo].[Dates](
[Date] [datetime2](3) ,
[DateKey] [decimal](38, 0) ,
..
..
[WeekDay] [nvarchar](100) ,
[Day Of Month] [decimal](38, 0)
)

WITH
(
• Generally use to load staging tables CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = ROUND_ROBIN
• Distribute data evenly across the table without )
;
additional optimization
• Joins are slow, because it requires to reshuffle data
• Default distribution type

Source: Microsoft
Hash Distribution Tables

• Highest performance for large tables


• Each row belong to one particular distribution
• It is used mostly for larger tables

Source: Microsoft
Hash Distribution Tables

Record Product Store


1 Soccer New York
2 Soccer Los Angeles
3 Football Phoenix
Hash Distribution Tables
• Highest performance for large tables
• Each row belong to one particular
distribution
• It is used mostly for larger tables

CREATE TABLE [dbo].[EquityTimeSeriesData](


[Date] [varchar](30) ,
[BookId] [decimal](38, 0) ,
[P&L] [decimal](31, 7) ,
[VaRLower] [decimal](31, 7)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([P&L])
)
;

Source: Microsoft
Avoid Data Skew
Even Distribution
Determines the method in which Azure SQL Data Warehouse spreads the data
across multiple nodes.

Azure SQL Data Warehouse uses up to 60 distributions when loading data into the
system.
Good Hash Key

Has more than


Distributes
60 distinct
Evenly
values

Is Not Used for


Updated Grouping

Used as Join
condition
What Data Distribution to Use?
Type Great fit for Watch out if…

Replicated Small-dimension tables in a • Many write transaction are on the table


star schema with less than (insert/update/delete)
2GB of storage after • You change DWU provisioning frequently
compression • You use only 2-3 columns, but your table has
many columns
• You index a replicated table

Round-robin (default) • Temporary/Staging table Performance is slow due to data movement


• No obvious joining key or
good candidate column.

hash • Fact tables The distribution key can’t be updated


• Large dimension tables
Data types

Use the smallest data type which will support your data

Avoid defining all character columns to a large default


length

Define columns as VARCHAR rather than NVARCHAR if


you don’t need Unicode
Data types

The goal is to not only save space but also move data as efficiently as possible.
Data types

Some complex data types (XML, geography, etc)


are not supported on Azure SQL Data
Warehouse yet.
Table types
Clustered • Updateable primary storage method
columnstore • Great for read-only

• Data is not in any particular order.


Heap • Use when data has no natural order.

• An index that is physically stored in the same


Clustered Index order as the data being indexed
High compression
Default table type
ratio

Clustered
columnstore

Ideally segments of No Secondary


1M rows Indexes
No index on the data Fast Load

Heap

Allows secondary
No compression
indexes
Sorted index on the data Fast singleton lookup

Clustered
B-Tree

Allows secondary
No compression
indexes
Table Partitioning
Table
Partitioning
Table partitions enable you to divide your data into
smaller groups of data
Improve the efficiency and performance of loading data
by use of partition deletion, switching and merging
Usually data is partitioned on a date column tied to when
the data is loaded into the database

Can also be used to improve query performance


Why Partitioning?
Partitions best practices

Creating a table Too many partitions can hurt


performance under some circumstances

Usually a successful partitioning scheme has 10 or a few


hundred partitions

Clustered column store tables, it is important to consider


how many rows belong to each partition

Before partitions are created, SQL Data warehouse


already divides each table into 60 distributed databases
A highly granular partitioning scheme can work
in SQL Server but hurt performance in Azure
SQL Data Warehouse.
Example

60 Distributions 365 Partitions 21900 Data Buckets

21900 Data Buckets Ideal Segment 21 900 000 000 Rows


Size (1M Rows)
Lower Granularity (week, month)
can perform better depending on
how much data you have.
Fact Tables

Large ones are better as Columnstores

Distributed through Hash key as much as


possible as long as it is even
Partitioned only if the table is large
enough to fill up each segment
Dimension Tables

Can be Hash distributed or Round-Robin if there is no clear candidate join key

Columnstore for large dimensions

Heap or Clustered Index for small dimensions

Add secondary indexes for alternate join columns

Partitioning not recommended


DEMO
Analyse data distribution at On-premises Datawarehouse before migrating to
Azure Synapse Data Pool.

• We will use Microsoft’s AdventureworksDW database as on-premises data warehouse.


• We will analyse one dimension and one fact table.
• Same process can be repeated to other tables of on-premises database.
Summary
MPP or Massive Parallel Processing
Billing = Compute + Storage
Data Distribution (Hash, Round-robin, Replicate)
Data types and Table types
Partitioning Data
Best practice – Fact and Dimension table design
Demo – Analyse Data Distribution

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy