Distributed OLAP Databases: Intro To Database Systems Andy Pavlo
Distributed OLAP Databases: Intro To Database Systems Andy Pavlo
Databases
ADMINISTRIVIA
ADMINISTRIVIA
LAST CLASS
B I F U R C AT E D E N V I R O N M E N T
Extract
Transform
Load
S TA R S C H E M A
PRODUCT_DIM CUSTOMER_DIM
CATEGORY_NAME ID
CATEGORY_DESC FIRST_NAME
PRODUCT_CODE SALES_FACT LAST_NAME
PRODUCT_NAME EMAIL
PRODUCT_DESC
PRODUCT_FK ZIP_CODE
TIME_FK
LOCATION_FK
CUSTOMER_FK
LOCATION_DIM TIME_DIM
COUNTRY PRICE YEAR
STATE_CODE QUANTITY DAY_OF_YEAR
STATE_NAME MONTH_NUM
ZIP_CODE MONTH_NAME
CITY DAY_OF_MONTH
CAT_LOOKUP
CATEGORY_ID
SNOWFLAKE SCHEMA
CATEGORY_NAME
CATEGORY_DESC
CUSTOMER_DIM
PRODUCT_DIM ID
CATEGORY_FK SALES_FACT FIRST_NAME
PRODUCT_CODE LAST_NAME
PRODUCT_NAME PRODUCT_FK EMAIL
PRODUCT_DESC ZIP_CODE
TIME_FK
LOCATION_FK
LOCATION_DIM CUSTOMER_FK TIME_DIM
COUNTRY YEAR
STATE_FK DAY_OF_YEAR
ZIP_CODE PRICE MONTH_FK
CITY DAY_OF_MONTH
QUANTITY
STATE_LOOKUP MONTH_LOOKUP
STATE_ID MONTH_NUM
STATE_CODE MONTH_NAME
STATE_NAME MONTH_SEASON
S TA R V S . S N O W F L A K E S C H E M A
PROBLEM SETUP
Application
Server P3 P4
PROBLEM SETUP
Application
Server P3 P4
T O D AY ' S A G E N D A
Execution Models
Query Planning
Distributed Join Algorithms
Cloud Systems
P U S H Q U E R Y T O D ATA
SELECT * FROM R JOIN S Node
ON R.id = S.id P1→ID:1-100
R⨝S
IDs [101,200] Result: R ⨝ S
Application
Server Node
P2→ID:101-200
P U L L D ATA T O Q U E R Y
P1→ID:1-100
SELECT * FROM R JOIN S Node
ON R.id = S.id Page ABC Storage
R⨝S
IDs [101,200] Page XYZ
Application
Server Node
P2→ID:101-200
CMU 15-445/645 (Fall 2019)
14
P U L L D ATA T O Q U E R Y
P1→ID:1-100
SELECT * FROM R JOIN S Node
ON R.id = S.id Page ABC Storage
R⨝S
IDs [101,200] Page XYZ
Application
Server Node
P2→ID:101-200
CMU 15-445/645 (Fall 2019)
14
P U L L D ATA T O Q U E R Y
P1→ID:1-100
SELECT * FROM R JOIN S Node
ON R.id = S.id Storage
R⨝S
IDs [101,200] Result: R ⨝ S
Application
Server Node
P2→ID:101-200
CMU 15-445/645 (Fall 2019)
15
O B S E R VAT I O N
Q U E R Y FA U LT T O L E R A N C E
Q U E R Y FA U LT T O L E R A N C E
SELECT * FROM R JOIN S Node
ON R.id = S.id Storage
R⨝S Result: R ⨝ S
Application
Server Node
Q U E R Y FA U LT T O L E R A N C E
SELECT * FROM R JOIN S Node
ON R.id = S.id Storage
Result: R ⨝ S
Application
Server Node
QUERY PL ANNING
QUERY PL AN FRAGMENTS
QUERY PL AN FRAGMENTS
SELECT * FROM R JOIN S
ON R.id = S.id
QUnion
U Ethe
R Youtput
P LofA N F R A G M E N T S
each join to produce
final result.
SELECT * FROM R JOIN S
ON R.id = S.id
O B S E R VAT I O N
SCENARIO #1
P1:R⨝S P2:R⨝S
Replicated S S Replicated
SCENARIO #1
Replicated S S Replicated
SCENARIO #2
P1:R⨝S P2:R⨝S
SCENARIO #2
SCENARIO #3
SCENARIO #3
S
Id:1-100 R{Id} R{Id} Id:101-200
SCENARIO #3
S S
Id:1-100 R{Id} R{Id} Id:101-200
SCENARIO #3
SCENARIO #3
SCENARIO #4
SCENARIO #4
R{Id} Id:101-200
SCENARIO #4
SCENARIO #4
S{Id} Id:101-200
SCENARIO #4
SCENARIO #4
P1:R⨝S P2:R⨝S
Id:1-100 R{Id} R{Id} Id:101-200
SCENARIO #4
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent
during joins. R S
→ This is like a projection pushdown. S
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent
during joins. R S
→ This is like a projection pushdown. R
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent
during joins. R S
→ This is like a projection pushdown. R.id
R.id
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent R.id
during joins. R S
→ This is like a projection pushdown. R.id
R E L AT I O N A L A L G E B R A : S E M I - J O I N
R(a_id,b_id,xxx) S(a_id,b_id,yyy)
Like a natural join except that the a_id b_id xxx a_id b_id yyy
attributes on the right table that are a1 101 X1 a3 103 Y1
CLOUD SYSTEMS
CLOUD SYSTEMS
S E R V E R L E S S D ATA B A S E S
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Storage
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Storage
Buffer Pool
Page Table
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Storage
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Storage
Node
D I S A G G R E G AT E D C O M P O N E N T S
System Catalogs
→ HCatalog, Google Data Catalog, Amazon Glue Data
Catalog
Node Management
→ Kubernetes, Apache YARN, Cloud Vendor Tools
Query Optimizers
→ Greenplum Orca, Apache Calcite
U N I V E R S A L F O R M AT S
U N I V E R S A L F O R M AT S
CONCLUSION
NEXT CLASS