Unit 4 Databases, Cloud & Snowflake: Prof. Thushara Weerawardane
Unit 4 Databases, Cloud & Snowflake: Prof. Thushara Weerawardane
¡ Structured Data ?
¡ Structured Data is neat, has known schema and could be fit in a fixed fields in a table.
¡ Unstructured Data ?
¡ Unstructured data has no schema or structure
Key Databases Categories
RDBMS ?
Data Warehouse ?
Data Lake ?
Data Warehouse
¡ A data warehouse exist on top of several databases and used for business
intelligence.
¡ Data warehouse consumes data from all these databases and creates a layer
optimized to perform data analytics.
¡ Schema is done on import.
Why Data Lakes?
¡ Data lakes were created to address the challenges that “big data”
introduced. Those challenges were often identified with the “3 Vs” of big
data: volume, verity and velocity.
Data Lake
¡ A data lake is a centralized repository where
you can store all of your structured, semi-
structured and unstructured data on any scale
for very low cost
¡ Data lake could be used to store raw data as is
without any structure (schema)
¡ There is no need to perform ETL or
transformation jobs on it
¡ You can store many types of data such as
image, text, files, videos, etc.
¡ You can store ML models artifacts, real time
data, and analytics outputs in data lakes.
¡ Processing could be done on expert , so
schema is defined on read.
ETL vs ELT
Extract Transform Load (ETL)
Insight
Raw Data
Extract Transform Load (ETL) cont.
Extract Load
Transform - Data send to warehouse
- From Sources
- Data Cleaning/Organizing - Batch load
- Structured Data
- Single system format - Incremental loading
- Unstructured Data
- Improving data Quality - Full loading
Logical and Physical extraction
Extract Load Transform (ELT)
Consistency:
Value preservation
ACID
Isolation:
Data remain isolated
Durability:
Data change is permanent
SQL vs NoSQL?
SQL NoSQL
¡ Key-value storage: Store every single item as key (DynamoDB, Redshift etc)
Good Productivity
High Scalability
Quick access
Example – Relational and Non-relational Databases
Example AWS Services
Data Ingestion
Data Analyze
Cloud Databases
Data is stored in a virtual environment and executes over the cloud computing. Some examples are
Sciencesoft
PhonixNAP
Could Architectures
Examples of Data Lake Architectures
Example for Traditional Data lake Architecture
ODS (operational data Store) is often an intermediary or staging area for a data warehouse, the ODS differs in that its data is overwritten and changes frequently. In contrast, a
data warehouse contains static data for archiving, storage, historical analysis, and reporting. A data mart serves the same purpose but comprises only one subject area.
Example Data lake Architecture
Example of Big Data Platform Architecture
Example for Modern Data Lake Pattern
Modern data architectures (simple)
Database Progression
RDBMS
(Structured/SQL)
Digital World
(Internet, decentralized, data silos)
Data Warehouse
(Data integration, Audit & Governance)
Big Data
(Distributed, batch & stream, data lakes)
Cloud Adoption
(Agility, Scalability, Cost effective, ease of use)
Cloud Efficiency
(intelligent Efficiencies, better economics)
Logical Data Zones
Modern Directions of Data Zones – Virtual Space
Data Lake on Cloud
- Different Processing
- Single source of truth
Capabilities
- Schema on read
- Flexible and quicker
- Decouple storage and
ingestion and
compute
consumption
- Scalability
- Reliability
- Data security
- Future Ready
- Low-cost storage
- Varied data access
pattern
Snowflake
Bulk Load & Unloading Data Adequate Data Protection & Security
Data Solution – Enterprise Data
Snowflake cont.
¡ No software, No Hardware, No maintenance. Snowflake is provided as Software-as-
a-Service (SaaS) that runs completely on cloud infrastructure
¡ Snowflake uses a central data repository for persisted data that is accessible from
all compute nodes in the data warehouse. (Shared Disk)
¡ Similar to shared-nothing architectures, Snowflake processes queries using MPP
(massively parallel processing) compute clusters where each node in the cluster
stores a portion of the entire data set locally (Shared Nothing)
¡ Shared disk architecture & Shared nothing architecture (SDA/SNA) (multi cluster
architecture)
¡ This approach offers the data management simplicity of a shared-disk architecture,
but with the performance and scale-out benefits of a shared-nothing architecture.
Snowflake Architecture
¡ Snowflake’s unique architecture
consists of three key layers:
¡ Cloud Services (The brain of the system)
¡ Query Processing (The muscles of the
system)
¡ Database Storage
¡ They all access same data source without any contention (Unlimited scale)
Authorization
What data can the user rightfully (The process through which the system grants
access ? a user/a process, the ability to access data
and carry out actions)